diff --git a/evaluation/with_skills/PER_SKILL_REVIEW_REPORT.md b/evaluation/with_skills/PER_SKILL_REVIEW_REPORT.md new file mode 100644 index 00000000..3698d689 --- /dev/null +++ b/evaluation/with_skills/PER_SKILL_REVIEW_REPORT.md @@ -0,0 +1,556 @@ +# Per-Skill Evaluation Review Report + +Review of each task under `tasks/per_skill_eval/` covering instructions, tests, skills, docs, and mock MCP. Criteria: instructions clear/realistic/fair/not overfitting; tests fair/not overfitting; mock MCP proper and realistic. + +--- + +## ocp-admin__cluster-report + +**Instructions:** Clear and realistic. Asks for cluster health and inventory report with version, nodes, projects, pods. Explicitly asks to document methodology. Does not mention skills. Fair scope. + +**Tests:** Conceptual checks (cluster version, node status, resource utilization, projects, workload stats, context awareness). No exact tool or field names. Fair and not overfitting. + +**Mock MCP:** mock-ocp-mcp provides multiple contexts (prod-us-east, prod-eu-west, staging-central, dev-k8s, legacy-dc), ClusterVersion, nodes, projects, pods. Realistic. Supports both OpenShift and non-OpenShift contexts. Good. + +**Remarks:** None. + +--- + +## rh-ai-engineer__ai-observability + +**Instructions:** Clear. Set up monitoring for AI/ML models: metrics, GPU utilization, right-sizing. Does not mention skills. Realistic. + +**Tests:** Conceptual (GPU monitoring, model metrics, right-sizing, Prometheus/Grafana, alerting). No tool-name matching. Fair. + +**Mock MCP:** Uses rhoai and openshift mocks; not ai-observability MCP. Skill expects get_gpu_info, analyze_vllm, etc. Agent can still describe methodology from the skill. Rhoai mock has inference services, projects; openshift has resources. Adequate for methodology documentation. + +**Remarks:** Mock does not implement ai-observability MCP tools. Agent relies on skill/docs and available rhoai/openshift tools. Acceptable for report-based evaluation. + +--- + +## rh-ai-engineer__debug-inference + +**Instructions:** Clear. Debug failing InferenceService: readiness, pod scheduling, resources, recommend fix. Does not mention skills. Realistic. + +**Tests:** Conceptual (readiness, scheduling, logs, resources, events, fix recommendation). Fair. + +**Mock MCP:** Rhoai mock has broken deployments (text-gen-legacy OOMKilled, nim-llama-prod failing). Openshift mock has pods, events, logs. Good for debugging. + +**Remarks:** None. + +--- + +## rh-ai-engineer__ds-project-setup + +**Instructions:** Clear. Set up data science project with storage, model serving, data connections. Does not mention skills. Realistic. + +**Tests:** Conceptual (project creation, data connections, model serving, credentials, dashboard). Fair. + +**Mock MCP:** Rhoai mock has projects, data connections, serving runtimes, inference services. Good coverage. + +**Remarks:** None. + +--- + +## rh-ai-engineer__model-deploy + +**Instructions:** Clear. Deploy ML model: serving runtime, InferenceService, GPU, common issues. Does not mention skills. Realistic. + +**Tests:** Conceptual (serving runtime, InferenceService, storage, GPU/resource, verification). Fair. + +**Mock MCP:** Rhoai mock has serving runtimes, inference services, deploy_model. Good. + +**Remarks:** None. + +--- + +## rh-ai-engineer__nim-setup + +**Instructions:** Clear. Set up NVIDIA NIM: prerequisites (GPU Operator, NFD), NGC auth, NIM Account. Does not mention skills. Realistic. + +**Tests:** Conceptual (GPU Operator, NFD, NGC auth, image pull secret, NIM Account). Fair. + +**Mock MCP:** Rhoai and openshift. NIM Account is a CR; agent can describe setup. Adequate. + +**Remarks:** None. + +--- + +## rh-ai-engineer__serving-runtime-config + +**Instructions:** Clear. Configure ServingRuntime: model format, container, platform integration. Does not mention skills. Realistic. + +**Tests:** Conceptual (API group, model format, multi-model, container config, platform integration). Fair. + +**Mock MCP:** Rhoai mock has list_serving_runtimes, serving runtime templates. Good. + +**Remarks:** None. + +--- + +## rh-ai-engineer__workbench-manage + +**Instructions:** Clear. Manage workbench: notebook image, resources, storage, lifecycle. Does not mention skills. Realistic. + +**Tests:** Conceptual (notebook image, resources, storage, lifecycle, data loss warning). Fair. + +**Mock MCP:** Rhoai mock may not expose workbench-specific tools (list_workbenches, create_workbench, etc.). Agent documents methodology from skill. Adequate for report-based eval. + +**Remarks:** Verify mock has workbench tools if agent is expected to call them. Otherwise methodology-only is acceptable. + +--- + +## rh-developer__containerize-deploy + +**Instructions:** Clear. Plan containerization (S2I, Dockerfile, Helm) and deployment. Does not mention skills. Realistic. + +**Tests:** Conceptual (strategy evaluation, deployment config). Fair. + +**Mock MCP:** Openshift mock with deployments, builds, projects. Good. + +**Remarks:** None. + +--- + +## rh-developer__debug-build + +**Instructions:** Clear. S2I build failing; examine config/logs, identify phase, recommend fix. Does not mention skills. Realistic. + +**Tests:** Conceptual (build config, phase, fix). Fair. + +**Mock MCP:** Openshift mock has builds with status Complete; api-service pod crashes at runtime (entry point), not during build. No failing S2I build in mock. Agent documents methodology from skill. Adequate for report-based eval. + +**Remarks:** Mock has no failing build; agent relies on skill/docs for build-debug methodology. Consider adding a failing build (e.g., failed pip install) for richer execution-based eval. + +--- + +## rh-developer__debug-container + +**Instructions:** Clear. Container failing at startup; inspect image/config, find cause, recommend fix. Does not mention skills. Realistic. + +**Tests:** Conceptual (image inspection, root cause, fix). Fair. + +**Mock MCP:** Openshift mock has containers. Adequate. + +**Remarks:** None. + +--- + +## rh-developer__debug-network + +**Instructions:** Clear. HTTP 503 via Route; trace Route→Service→Pod, find misconfiguration. Does not mention skills. Realistic. + +**Tests:** Conceptual (request path, misconfiguration, fix). Fair. + +**Mock MCP:** Openshift mock has order-system with 503 (selector mismatch). Good. + +**Remarks:** None. + +--- + +## rh-developer__debug-pipeline + +**Instructions:** Clear. Tekton PipelineRun failed; examine status, find failing task, recommend fix. Does not mention skills. Realistic. + +**Tests:** Conceptual (PipelineRun, task, fix/retry). Fair. + +**Mock MCP:** Openshift mock has pipeline data. Good. + +**Remarks:** None. + +--- + +## rh-developer__debug-pod + +**Instructions:** Clear. Pod in web-frontend namespace crashing; investigate, find cause, recommend fix. Does not mention skills. Aligned with mock (web-frontend has OOMKilled). Realistic. + +**Tests:** Conceptual (OOM/memory, exit code, previous logs, resource limits, events, remediation). Fair. + +**Mock MCP:** Openshift mock has web-frontend with OOMKilled (exit 137, 64Mi limit). Good alignment. + +**Remarks:** None. + +--- + +## rh-developer__debug-rhel + +**Instructions:** Clear. RHEL service failing; check service, SELinux, firewall, recommend fix. Does not mention skills. Realistic. + +**Tests:** Conceptual (service, SELinux, firewall, fix). Fair. + +**Mock MCP:** Uses available tools; RHEL debugging may be more doc/skill-driven. Adequate. + +**Remarks:** None. + +--- + +## rh-developer__deploy + +**Instructions:** Clear. Plan deployment: strategy, Service, Route, image, ports. Does not mention skills. Realistic. + +**Tests:** Conceptual (Deployment, Service, Route, image, ports). Fair. + +**Mock MCP:** Openshift mock has deployments, services, routes. Good. + +**Remarks:** None. + +--- + +## rh-developer__detect-project + +**Instructions:** Clear. Detect project type, language, framework from source. Does not mention skills. Realistic. + +**Tests:** Conceptual (language, framework, deployment strategy). Fair. + +**Mock MCP:** May use Read tool for source; MCP for cluster context. Adequate. + +**Remarks:** None. + +--- + +## rh-developer__helm-deploy + +**Instructions:** Clear. Plan Helm deployment: chart, values, OpenShift specifics. Does not mention skills. Realistic. + +**Tests:** Conceptual (Helm chart, values, OpenShift). Fair. + +**Mock MCP:** Openshift mock. Adequate. + +**Remarks:** None. + +--- + +## rh-developer__recommend-image + +**Instructions:** Clear. Recommend base image for project type (UBI, security, size). Does not mention skills. Realistic. + +**Tests:** Conceptual (base image, UBI, selection criteria). Fair. + +**Mock MCP:** May use project metadata from mock. Adequate. + +**Remarks:** None. + +--- + +## rh-developer__rhel-deploy + +**Instructions:** Clear. Plan RHEL deployment: systemd, SELinux, volumes, networking. Does not mention skills. Realistic. + +**Tests:** Conceptual (systemd, SELinux, volumes, networking). Fair. + +**Mock MCP:** Adequate for methodology documentation. + +**Remarks:** None. + +--- + +## rh-developer__s2i-build + +**Instructions:** Clear. Configure S2I for Python app: builder, build process, entry point. Does not mention skills. Realistic. + +**Tests:** Conceptual (builder image, entry point, BuildConfig, dependencies). Fair. + +**Mock MCP:** Openshift mock has builds, api-platform. Good. + +**Remarks:** None. + +--- + +## rh-developer__validate-environment + +**Instructions:** Clear. Validate OpenShift: connectivity, permissions, resources, readiness. Does not mention skills. Realistic. + +**Tests:** Conceptual (connectivity, permissions, resources, readiness). Fair. + +**Mock MCP:** Openshift mock. Adequate. + +**Remarks:** None. + +--- + +## rh-sre__cve-impact + +**Instructions:** Clear. Analyze CVE impact: affected systems, scope, pagination. Does not mention skills. Realistic. + +**Tests:** Conceptual (affected systems count, pagination, environment breakdown, remediation readiness, severity). Fair. + +**Mock MCP:** mock-lightspeed-mcp has 63 systems, 5 CVEs, get_cves, get_cve, get_cve_systems, get_system_cves. Realistic fleet and CVE data. Good. + +**Remarks:** None. + +--- + +## rh-sre__cve-validation + +**Instructions:** Clear. Validate CVEs: identifiers, severity, fixes, remediation status. Does not mention skills. Realistic. + +**Tests:** Conceptual (CVE validation, advisories, classification). Fair. + +**Mock MCP:** Lightspeed mock. Good. + +**Remarks:** None. + +--- + +## rh-sre__execution-summary + +**Instructions:** Minimal. "Complete the execution summary analysis." Vague but does not overfit. Agent discovers scope from skill. + +**Tests:** Conceptual (execution summary concepts). Fair. + +**Mock MCP:** AAP and Lightspeed mocks. Adequate. + +**Remarks:** Instruction could be slightly more specific (e.g., "document tools and steps used in a remediation workflow") without overfitting. + +--- + +## rh-sre__fleet-inventory + +**Instructions:** Minimal. "Complete the fleet inventory analysis." Vague but fair. Agent discovers scope from skill. + +**Tests:** Conceptual (fleet inventory concepts). Fair. + +**Mock MCP:** Lightspeed mock with 63 systems. Good. + +**Remarks:** Same as execution-summary: optional minor clarification. + +--- + +## rh-sre__job-template-creator + +**Instructions:** Minimal. "Create an AAP job template for CVE remediation." Fair. Agent discovers details from skill. + +**Tests:** Conceptual (job template creation). Fair. + +**Mock MCP:** AAP mock with job templates, projects. Good. + +**Remarks:** None. + +--- + +## rh-sre__job-template-remediation-validator + +**Instructions:** Minimal. "Validate an AAP job template for CVE remediation." Fair. + +**Tests:** Conceptual (template validation). Fair. + +**Mock MCP:** AAP mock. Good. + +**Remarks:** None. + +--- + +## rh-sre__mcp-aap-validator + +**Instructions:** Clear. Validate AAP MCP connectivity and functionality. Does not mention skills. Realistic. + +**Tests:** Conceptual (connectivity, auth, tool availability, error diagnostics, structured output). Fair. + +**Mock MCP:** AAP mock. Agent validates by calling tools. Good. + +**Remarks:** None. + +--- + +## rh-sre__mcp-lightspeed-validator + +**Instructions:** Clear. Validate Lightspeed MCP connectivity and functionality. Does not mention skills. Realistic. + +**Tests:** Conceptual (connectivity, auth, tools, diagnostics). Fair. + +**Mock MCP:** Lightspeed mock. Good. + +**Remarks:** None. + +--- + +## rh-sre__playbook-executor + +**Instructions:** Clear. Execute remediation playbook via AAP, pre-flight, monitoring. Does not mention skills. Realistic. + +**Tests:** Conceptual (pre-flight, dry run, monitoring, validation, git/source). Fair. + +**Mock MCP:** AAP mock with job templates, projects, jobs, launch. Good. + +**Remarks:** None. + +--- + +## rh-sre__playbook-generator + +**Instructions:** Minimal. "Generate a CVE remediation playbook using Red Hat Insights/Lightspeed." Fair. + +**Tests:** Conceptual (playbook generation). Fair. + +**Mock MCP:** Lightspeed mock with create_vulnerability_playbook. Good. + +**Remarks:** None. + +--- + +## rh-sre__remediation + +**Instructions:** Minimal. "Orchestrate CVE remediation from validation through execution and verification." Fair. + +**Tests:** Conceptual (remediation orchestration). Fair. + +**Mock MCP:** AAP and Lightspeed. Good. + +**Remarks:** None. + +--- + +## rh-sre__remediation-verifier + +**Instructions:** Minimal. "Verify CVE remediation was applied." Fair. + +**Tests:** Conceptual (verification). Fair. + +**Mock MCP:** Lightspeed mock. Good. + +**Remarks:** None. + +--- + +## rh-sre__system-context + +**Instructions:** Minimal. "Gather system context for remediation decisions." Fair. + +**Tests:** Conceptual (system context). Fair. + +**Mock MCP:** Lightspeed mock with system data. Good. + +**Remarks:** None. + +--- + +## rh-virt__vm-clone + +**Instructions:** Clear. Clone production-db (prod-vms) to test-db-clone (test-env). Does not mention skills. Realistic. + +**Tests:** Conceptual (cloning strategy, storage, independence). Fair. + +**Mock MCP:** mock-virt-mcp has VMs but not production-db in prod-vms. Uses virt-prod-dc1, virt-prod-dc2, etc. Agent documents methodology for the given scenario. Adequate. + +**Remarks:** Instruction VM/namespace (production-db, prod-vms) not in mock. Acceptable for methodology documentation. + +--- + +## rh-virt__vm-create + +**Instructions:** Clear. Plan VM test-vm in vm-testing. Does not mention skills. Realistic. + +**Tests:** Conceptual (VM spec, storage, error handling). Fair. + +**Mock MCP:** Virt mock. Agent can describe creation plan. Good. + +**Remarks:** test-vm and vm-testing not in mock; acceptable for planning task. + +--- + +## rh-virt__vm-delete + +**Instructions:** Clear. Plan deletion of legacy-app in decommission. Does not mention skills. Realistic. + +**Tests:** Conceptual (safety checks, scope, safeguards). Fair. + +**Mock MCP:** Virt mock. Adequate. + +**Remarks:** legacy-app and decommission not in mock; acceptable for planning. + +--- + +## rh-virt__vm-inventory + +**Instructions:** Clear. Produce VM inventory: all namespaces, status, resources, OS, IPs, organization. Does not mention skills. Realistic. + +**Tests:** Conceptual (VM status, CPU/memory, OS, network, storage, node, sort). No tool/field names. Fair. + +**Mock MCP:** mock-virt-mcp has 32 VMs across namespaces, VM/VMI, nodes, PVCs. Good. VMI may lack volumeStatus; agent can still produce inventory from VM and VMI data. + +**Remarks:** None. + +--- + +## rh-virt__vm-lifecycle-manager + +**Instructions:** Clear. Stop web-frontend, restart production-db in prod-vms. Does not mention skills. Realistic. + +**Tests:** Conceptual (lifecycle procedures, sequencing, verification). Fair. + +**Mock MCP:** Virt mock. Adequate. + +**Remarks:** web-frontend, production-db, prod-vms not in mock; acceptable for methodology. + +--- + +## rh-virt__vm-rebalance + +**Instructions:** Clear. Migrate production-db from overloaded node. Does not mention skills. Realistic. + +**Tests:** Conceptual (migration feasibility, target node, safety). Fair. + +**Mock MCP:** Virt mock has nodes and utilization. Good. + +**Remarks:** production-db not in mock; acceptable. + +--- + +## rh-virt__vm-snapshot-create + +**Instructions:** Clear. Snapshot production-db in prod-vms; prerequisites, spec, consistency. Does not mention skills. Realistic. + +**Tests:** Conceptual (prerequisites, consistency, spec, monitoring, volume check). Baseline requires "production-db" (from instruction). Fair. + +**Mock MCP:** Virt mock does not implement VirtualMachineSnapshot in resources_list. Agent documents plan from skill. Adequate for methodology documentation. + +**Remarks:** production-db and prod-vms not in mock; test expects "production-db" from instruction. Consistent. Snapshot CRs not in mock; agent works from skill. + +--- + +## rh-virt__vm-snapshot-delete + +**Instructions:** Clear. Delete snapshot production-db-backup-20240215 for production-db in prod-vms. Does not mention skills. Realistic. + +**Tests:** Conceptual (safety, confirmation, verification). Fair. + +**Mock MCP:** Virt mock. Adequate. + +**Remarks:** None. + +--- + +## rh-virt__vm-snapshot-list + +**Instructions:** Clear. List snapshots for production-db in prod-vms. Does not mention skills. Realistic. + +**Tests:** Conceptual (snapshot list, status, timestamps). Fair. + +**Mock MCP:** Virt mock. Check if VirtualMachineSnapshot is supported. Adequate. + +**Remarks:** None. + +--- + +## rh-virt__vm-snapshot-restore + +**Instructions:** Clear. Restore production-db from snapshot production-db-backup-20240301. Does not mention skills. Realistic. + +**Tests:** Conceptual (readiness, VM state, safeguards). Fair. + +**Mock MCP:** Virt mock. Adequate. + +**Remarks:** None. + +--- + +## Summary + +**Overall:** Instructions are clear, realistic, and do not mention skills. Tests use conceptual checks and avoid exact tool/field matching. Mocks are generally appropriate. + +**Notable points:** +- ai-observability: Mock uses rhoai/openshift, not ai-observability MCP; acceptable for methodology documentation. +- workbench-manage: Confirm workbench tools exist in mock if execution is expected. +- debug-build: Confirm mock includes a failing build scenario. +- rh-sre minimal instructions (execution-summary, fleet-inventory, etc.): Vague but fair; optional minor clarification. +- rh-virt: Several tasks reference VMs/namespaces (production-db, prod-vms, etc.) not in mock; acceptable for planning/methodology tasks. diff --git a/evaluation/with_skills/SKILL_PATH_FIXES.md b/evaluation/with_skills/SKILL_PATH_FIXES.md new file mode 100644 index 00000000..2daa163e --- /dev/null +++ b/evaluation/with_skills/SKILL_PATH_FIXES.md @@ -0,0 +1,180 @@ +# Per-Skill Evaluation: Skill Path Fixes + +This document records all modifications made to SKILL.md files and environment +directories to ensure paths resolve correctly when the agent runs inside the +Harbor container. + +## Container Layout + +The Dockerfile copies environment contents into: + +``` +/root/ +├── .claude/skills//SKILL.md # from environment/skills/ +├── .claude/docs/... # from environment/docs/ +├── docs/... # second copy of docs +├── .mcp.json # generated or copied +└── .mcp-servers/ # from environment/mcp-servers/ +``` + +From a SKILL.md at `/root/.claude/skills//SKILL.md`: +- `../../docs/` resolves to `/root/.claude/docs/` +- `../../../docs/` resolves to `/root/docs/` (second copy) +- `../references/` resolves to `/root/.claude/skills/references/` +- `./` resolves to `/root/.claude/skills//` + +--- + +## Fixes Applied + +### 1. rh-ai-engineer (7 tasks): Added shared `skills/references/` + +**Tasks**: ai-observability, debug-inference, ds-project-setup, model-deploy, +nim-setup, serving-runtime-config, workbench-manage + +**Problem**: SKILL.md files reference `../references/skill-conventions.md`, +`../references/live-doc-lookup.md`, and `../references/common-issues.md`. +These expect `environment/skills/references/` to exist. It was missing. + +**Fix**: Copied `agentic-collections/rh-ai-engineer/skills/references/` into +each task's `environment/skills/references/` directory. + +**Files added** (per task): +- `environment/skills/references/skill-conventions.md` +- `environment/skills/references/live-doc-lookup.md` +- `environment/skills/references/common-issues.md` + +--- + +### 2. ocp-admin__cluster-report: Added `scripts/` and `.mcp.json` + +**Problem**: SKILL.md references `../../scripts/cluster-report/assemble.py`, +`../../scripts/cluster-report/aggregate.py`, +`../../scripts/cluster-report/build-kubeconfig.py`, and `../../.mcp.json`. +None were present in the environment. + +**Fix**: Copied from `agentic-collections/ocp-admin/`: +- `scripts/cluster-report/` (6 files) into `environment/scripts/cluster-report/` +- `.mcp.json` into `environment/.mcp.json` + +--- + +### 3. rh-sre (7 tasks): Added cross-referenced skill directories + +**Problem**: Several SRE skills reference other skills via `../other-skill/SKILL.md`. +In the per-skill evaluation, only the evaluated skill is included, so cross-refs +broke. + +**Fix**: Copied the referenced skill directories from +`agentic-collections/rh-sre/skills/` into each task's `environment/skills/`: + +| Task | Added skills | +|------|-------------| +| rh-sre__cve-impact | mcp-lightspeed-validator | +| rh-sre__cve-validation | mcp-lightspeed-validator | +| rh-sre__fleet-inventory | mcp-lightspeed-validator | +| rh-sre__job-template-creator | mcp-aap-validator, playbook-executor | +| rh-sre__job-template-remediation-validator | mcp-aap-validator, playbook-executor, job-template-creator | +| rh-sre__playbook-executor | mcp-aap-validator | +| rh-sre__remediation | cve-validation | + +--- + +### 4. rh-developer (5 tasks): Added `templates/` + +**Tasks**: containerize-deploy, deploy, detect-project, helm-deploy, rhel-deploy + +**Problem**: SKILL.md files reference `templates/deployment.yaml.template`, +`templates/helm/`, `templates/systemd/`, etc. The templates directory was +not present in the environment. + +**Fix**: Copied `agentic-collections/rh-developer/templates/` into each +task's `environment/templates/` directory. + +--- + +### 5. rh-sre__cve-impact: Fixed dangling doc references (SKILL.md modified) + +**Problem**: SKILL.md referenced `insights-api.md` and `fleet-management.md` +in `../../docs/insights/`. These files do not exist in the source +agentic-collections repository. + +**Fix**: Replaced broken links with references to +`vulnerability-logic.md` (which exists at `../../docs/insights/vulnerability-logic.md` +and covers related content): + +| Original reference | Replaced with | +|-------------------|---------------| +| `../../docs/insights/insights-api.md` | `../../docs/insights/vulnerability-logic.md` | +| `../../docs/insights/fleet-management.md` | `../../docs/insights/vulnerability-logic.md` | + +Lines changed: 221-222, 252-253, 394-395 + +--- + +### 6. rh-sre__fleet-inventory: Fixed dangling doc references (SKILL.md modified) + +**Problem**: Same as cve-impact — references to non-existent `insights-api.md` +and `fleet-management.md`. + +**Fix**: Same replacement to `vulnerability-logic.md`. + +Lines changed: 101-102, 127-128, 219-220 + +--- + +### 7. rh-sre__cve-impact: Fixed path depth `../../../docs/` → `../../docs/` + +**Problem**: Two references used `../../../docs/references/` (three levels up) +instead of `../../docs/references/` (two levels up). Both paths work inside +the container (docs is at both `/root/.claude/docs/` and `/root/docs/`), but +`../../docs/` is the canonical path. + +**Fix**: Changed `../../../docs/` to `../../docs/` in two places: +- Line 23: `skill-invocation.md` +- Line 325: `lightspeed-mcp-tool-failures.md` + +--- + +### 8. rh-sre__cve-validation: Fixed path depth `../../../docs/` → `../../docs/` + +**Problem**: Same path depth issue as cve-impact. + +**Fix**: Changed `../../../docs/references/skill-invocation.md` path from +`../../../docs/` to `../../docs/`. + +Line changed: 24 + +--- + +### 9. rh-virt__vm-rebalance: Fixed citation paths (SKILL.md modified) + +**Problem**: SKILL.md uses absolute-style paths +`rh-virt/skills/vm-rebalance/REBALANCE_MANUAL.md` in agent output citation +text. These don't resolve from the skill directory. The actual Read +instructions correctly use `./REBALANCE_MANUAL.md`. + +**Fix**: Changed citation paths to use relative `./` prefix: +- `rh-virt/skills/vm-rebalance/REBALANCE_MANUAL.md` → `./REBALANCE_MANUAL.md` +- `rh-virt/skills/vm-rebalance/REBALANCE_AUTOMATIC.md` → `./REBALANCE_AUTOMATIC.md` + +Lines changed: 94, 103 + +--- + +## Remaining Non-Issues (false positives) + +| Task | Pattern | Explanation | +|------|---------|-------------| +| rh-developer__debug-rhel | `[path](/.*)? ` | SELinux fcontext regex, not a file link | +| rh-developer__rhel-deploy | `[app-name](/.*)? ` | SELinux fcontext regex, not a file link | + +These appear in `semanage fcontext` shell command examples. The markdown +link syntax parser matches them, but they are regex patterns, not file +references. + +--- + +## Validation Results + +After all fixes: **269 paths OK, 0 real broken references**. diff --git a/evaluation/with_skills/ocp-admin__cluster-report/environment/.mcp.json b/evaluation/with_skills/ocp-admin__cluster-report/environment/.mcp.json new file mode 100644 index 00000000..5cd15768 --- /dev/null +++ b/evaluation/with_skills/ocp-admin__cluster-report/environment/.mcp.json @@ -0,0 +1,20 @@ +{ + "mcpServers": { + "openshift": { + "command": "bash", + "args": [ + "-c", + "U=(); [ \"$(uname -s)\" = Linux ] && U=(--userns=keep-id:uid=65532,gid=65532); exec podman run \"${U[@]}\" --rm -i --network=host -v \"${KUBECONFIG}:/kubeconfig:ro,Z\" --entrypoint /app/kubernetes-mcp-server quay.io/ecosystem-appeng/openshift-mcp-server:latest --kubeconfig /kubeconfig --read-only --toolsets core,config" + ], + "env": { + "KUBECONFIG": "${KUBECONFIG}" + }, + "description": "Red Hat OpenShift MCP server for multi-cluster administration and reporting", + "security": { + "isolation": "container", + "network": "local", + "credentials": "env-only" + } + } + } +} diff --git a/evaluation/with_skills/ocp-admin__cluster-report/environment/Dockerfile b/evaluation/with_skills/ocp-admin__cluster-report/environment/Dockerfile new file mode 100644 index 00000000..b49ea754 --- /dev/null +++ b/evaluation/with_skills/ocp-admin__cluster-report/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-ocp-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/ocp-admin__cluster-report/environment/mcp-servers/mock-ocp-mcp.py b/evaluation/with_skills/ocp-admin__cluster-report/environment/mcp-servers/mock-ocp-mcp.py new file mode 100644 index 00000000..65e0b6b5 --- /dev/null +++ b/evaluation/with_skills/ocp-admin__cluster-report/environment/mcp-servers/mock-ocp-mcp.py @@ -0,0 +1,304 @@ +#!/usr/bin/env python3 + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +CONTEXTS = [ + ("prod-us-east", "https://api.prod-us-east.example.com:6443", "OpenShift 4.16.3", 6, "high"), + ("prod-eu-west", "https://api.prod-eu-west.example.com:6443", "OpenShift 4.15.12", 4, "moderate"), + ("staging-central", "https://api.staging-central.example.com:6443", "OpenShift 4.16.1", 3, "low"), + ("dev-k8s", "https://dev-k8s.internal.example.com:6443", "Kubernetes", 2, "low"), + ("legacy-dc", "https://legacy-dc.example.com:6443", "OpenShift 4.14", 5, "unknown"), +] + +UNREACHABLE = {"legacy-dc"} +OPENSHIFT_CONTEXTS = {"prod-us-east", "prod-eu-west", "staging-central", "legacy-dc"} +NON_OPENSHIFT = {"dev-k8s"} + + +def _check_context(context): + ctx = (context or "prod-us-east").strip() + if ctx in UNREACHABLE: + raise ConnectionError(f"Connection refused to {ctx}") + valid = {c[0] for c in CONTEXTS} + if ctx not in valid: + raise ValueError(f"Unknown context: {ctx}") + return ctx + + +def _format_tabular(headers, rows): + if not headers or not rows: + return "" + widths = [len(h) for h in headers] + for row in rows: + for i, h in enumerate(headers): + val = str(row.get(h, "")) + widths[i] = max(widths[i], len(val)) + lines = [] + header_line = "".join(h.ljust(w + 2) for h, w in zip(headers, widths)) + lines.append(header_line.rstrip()) + for row in rows: + line = "".join(str(row.get(h, "")).ljust(w + 2) for h, w in zip(headers, widths)) + lines.append(line.rstrip()) + return "\n".join(lines) + + +# Node data for resources_get (Node kind) +NODE_DATA = { + "prod-us-east": { + "node-us-master-1": { + "metadata": {"name": "node-us-master-1", "labels": {"node-role.kubernetes.io/master": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "16Gi", "pods": "250"}, "conditions": []}, + }, + "node-us-master-2": { + "metadata": {"name": "node-us-master-2", "labels": {"node-role.kubernetes.io/master": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "16Gi", "pods": "250"}, "conditions": []}, + }, + "node-us-master-3": { + "metadata": {"name": "node-us-master-3", "labels": {"node-role.kubernetes.io/master": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "16Gi", "pods": "250"}, "conditions": []}, + }, + "node-us-worker-1": { + "metadata": {"name": "node-us-worker-1", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": { + "allocatable": {"cpu": "32", "memory": "128Gi", "pods": "250", "nvidia.com/gpu": "4"}, + "conditions": [], + }, + }, + "node-us-worker-2": { + "metadata": {"name": "node-us-worker-2", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, "conditions": []}, + }, + "node-us-worker-3": { + "metadata": {"name": "node-us-worker-3", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250", "nvidia.com/gpu": "4"}, + "conditions": [], + }, + }, + }, + "prod-eu-west": { + "node-eu-master-1": { + "metadata": {"name": "node-eu-master-1", "labels": {"node-role.kubernetes.io/master": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "16Gi", "pods": "250"}, "conditions": []}, + }, + "node-eu-worker-1": { + "metadata": {"name": "node-eu-worker-1", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, "conditions": []}, + }, + "node-eu-worker-2": { + "metadata": {"name": "node-eu-worker-2", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, "conditions": []}, + }, + "node-eu-worker-3": { + "metadata": {"name": "node-eu-worker-3", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, "conditions": []}, + }, + }, + "staging-central": { + "node-staging-master-1": { + "metadata": {"name": "node-staging-master-1", "labels": {"node-role.kubernetes.io/master": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "16Gi", "pods": "250"}, "conditions": []}, + }, + "node-staging-worker-1": { + "metadata": {"name": "node-staging-worker-1", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, "conditions": []}, + }, + "node-staging-worker-2": { + "metadata": {"name": "node-staging-worker-2", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, "conditions": []}, + }, + }, + "dev-k8s": { + "node-dev-1": { + "metadata": {"name": "node-dev-1", "labels": {"node-role.kubernetes.io/control-plane": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "8Gi", "pods": "110"}, "conditions": []}, + }, + "node-dev-2": { + "metadata": {"name": "node-dev-2", "labels": {}}, + "status": {"allocatable": {"cpu": "4", "memory": "8Gi", "pods": "110"}, "conditions": []}, + }, + }, +} + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all kubeconfig contexts with server URLs and cluster info.""" + headers = ["CONTEXT", "SERVER", "VERSION", "NODES", "UTILIZATION"] + rows = [{"CONTEXT": c[0], "SERVER": c[1], "VERSION": c[2], "NODES": str(c[3]), "UTILIZATION": c[4]} for c in CONTEXTS] + return _format_tabular(headers, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str | None = None, + context: str | None = None, +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + ctx = _check_context(context) + + if apiVersion == "config.openshift.io/v1" and kind == "ClusterVersion": + if ctx in NON_OPENSHIFT: + raise ValueError("ClusterVersion not found (non-OpenShift cluster)") + versions = { + "prod-us-east": "4.16.3", + "prod-eu-west": "4.15.12", + "staging-central": "4.16.1", + "legacy-dc": "4.14", + } + ver = versions.get(ctx, "4.16.0") + return f'{{"apiVersion":"config.openshift.io/v1","kind":"ClusterVersion","metadata":{{"name":"version"}},"status":{{"desired":{{"version":"{ver}"}}}}}}' + + if apiVersion == "v1" and kind == "Node": + nodes = NODE_DATA.get(ctx, {}) + if name not in nodes: + raise ValueError(f"Node {name} not found") + return json.dumps(nodes[name]) + + raise ValueError(f"Unsupported resource: {apiVersion}/{kind}") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str | None = None, + context: str | None = None, +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + ctx = _check_context(context) + + if apiVersion == "v1" and kind == "Node": + nodes = NODE_DATA.get(ctx, {}) + return json.dumps(list(nodes.values())) + + if apiVersion == "v1" and kind == "Namespace": + return namespaces_list(context=ctx) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def nodes_top(context: str | None = None) -> str: + """Return node CPU and memory usage from Metrics Server.""" + ctx = _check_context(context) + + # prod-us-east: node-us-worker-1 (28.4/32=89%, 112.6/128=88%), node-us-worker-3 (14.2/16=89%, 56.8/64=89%) + if ctx == "prod-us-east": + rows = [ + {"NAME": "node-us-master-1", "CPU(cores)": "1.2", "MEMORY(bytes)": "4Gi"}, + {"NAME": "node-us-master-2", "CPU(cores)": "1.1", "MEMORY(bytes)": "3.8Gi"}, + {"NAME": "node-us-master-3", "CPU(cores)": "1.0", "MEMORY(bytes)": "3.6Gi"}, + {"NAME": "node-us-worker-1", "CPU(cores)": "28.4", "MEMORY(bytes)": "112.6Gi"}, + {"NAME": "node-us-worker-2", "CPU(cores)": "8.2", "MEMORY(bytes)": "32Gi"}, + {"NAME": "node-us-worker-3", "CPU(cores)": "14.2", "MEMORY(bytes)": "56.8Gi"}, + ] + elif ctx == "prod-eu-west": + rows = [ + {"NAME": "node-eu-master-1", "CPU(cores)": "0.8", "MEMORY(bytes)": "3Gi"}, + {"NAME": "node-eu-worker-1", "CPU(cores)": "6.2", "MEMORY(bytes)": "24Gi"}, + {"NAME": "node-eu-worker-2", "CPU(cores)": "5.8", "MEMORY(bytes)": "22Gi"}, + {"NAME": "node-eu-worker-3", "CPU(cores)": "7.1", "MEMORY(bytes)": "28Gi"}, + ] + elif ctx == "staging-central": + rows = [ + {"NAME": "node-staging-master-1", "CPU(cores)": "0.5", "MEMORY(bytes)": "2Gi"}, + {"NAME": "node-staging-worker-1", "CPU(cores)": "2.1", "MEMORY(bytes)": "8Gi"}, + {"NAME": "node-staging-worker-2", "CPU(cores)": "1.8", "MEMORY(bytes)": "7Gi"}, + ] + elif ctx == "dev-k8s": + rows = [ + {"NAME": "node-dev-1", "CPU(cores)": "1.2", "MEMORY(bytes)": "3Gi"}, + {"NAME": "node-dev-2", "CPU(cores)": "2.0", "MEMORY(bytes)": "5Gi"}, + ] + else: + rows = [] + + headers = ["NAME", "CPU(cores)", "MEMORY(bytes)"] + return _format_tabular(headers, rows) + + +@mcp.tool() +def pods_list(namespace: str | None = None, context: str | None = None) -> str: + """List pods across namespaces.""" + ctx = _check_context(context) + + if ctx == "prod-us-east": + rows = [ + {"NAMESPACE": "batch-jobs", "NAME": "data-pipeline-batch-abc", "STATUS": "Failed"}, + {"NAMESPACE": "batch-jobs", "NAME": "data-pipeline-batch-def", "STATUS": "Failed"}, + {"NAMESPACE": "ci-cd", "NAME": "image-builder", "STATUS": "CrashLoopBackOff"}, + {"NAMESPACE": "app-platform", "NAME": "deploy-canary", "STATUS": "Pending"}, + {"NAMESPACE": "default", "NAME": "api-server", "STATUS": "Running"}, + {"NAMESPACE": "default", "NAME": "web-frontend", "STATUS": "Running"}, + {"NAMESPACE": "openshift-monitoring", "NAME": "prometheus-0", "STATUS": "Running"}, + ] + elif ctx == "prod-eu-west": + rows = [ + {"NAMESPACE": "security", "NAME": "compliance-scanner-failed", "STATUS": "Failed"}, + {"NAMESPACE": "default", "NAME": "api-eu", "STATUS": "Running"}, + ] + elif ctx == "staging-central": + rows = [ + {"NAMESPACE": "staging-apps", "NAME": "image-pull-broken-pod", "STATUS": "ImagePullBackOff"}, + {"NAMESPACE": "default", "NAME": "staging-api", "STATUS": "Running"}, + ] + elif ctx == "dev-k8s": + rows = [ + {"NAMESPACE": "default", "NAME": "dev-pod-1", "STATUS": "Running"}, + {"NAMESPACE": "kube-system", "NAME": "coredns-xyz", "STATUS": "Running"}, + ] + else: + rows = [] + + headers = ["NAMESPACE", "NAME", "STATUS"] + return _format_tabular(headers, rows) + + +@mcp.tool() +def projects_list(context: str | None = None) -> str: + """List OpenShift projects.""" + ctx = _check_context(context) + if ctx in NON_OPENSHIFT: + raise ValueError("projects_list is OpenShift-only; use namespaces_list for vanilla Kubernetes") + + counts = {"prod-us-east": 21, "prod-eu-west": 16, "staging-central": 12, "legacy-dc": 8} + n = counts.get(ctx, 5) + rows = [{"NAME": f"project-{i}"} for i in range(1, n + 1)] + headers = ["NAME"] + return _format_tabular(headers, rows) + + +@mcp.tool() +def namespaces_list(context: str | None = None) -> str: + """List all namespaces in a cluster.""" + ctx = _check_context(context) + + if ctx == "dev-k8s": + # 6 namespaces for vanilla Kubernetes + rows = [ + {"NAME": "default"}, + {"NAME": "kube-system"}, + {"NAME": "kube-public"}, + {"NAME": "kube-node-lease"}, + {"NAME": "app-dev"}, + {"NAME": "monitoring"}, + ] + else: + # OpenShift: projects map to namespaces + counts = {"prod-us-east": 21, "prod-eu-west": 16, "staging-central": 12} + n = counts.get(ctx, 5) + rows = [{"NAME": f"project-{i}"} for i in range(1, n + 1)] + + headers = ["NAME"] + return _format_tabular(headers, rows) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/ocp-admin__cluster-report/instruction.md b/evaluation/with_skills/ocp-admin__cluster-report/instruction.md new file mode 100644 index 00000000..b13ffc9a --- /dev/null +++ b/evaluation/with_skills/ocp-admin__cluster-report/instruction.md @@ -0,0 +1,17 @@ +# Cluster Report Task + +You are an OpenShift cluster administrator. Your operations lead has requested a comprehensive infrastructure health snapshot for the weekly review. Your environment has multiple cluster contexts configured. + +## Requirements +- Discover all available cluster contexts in your environment +- For each accessible OpenShift cluster, report: + - Cluster version and API server URL + - All nodes with their status (Ready/NotReady), roles, and resource utilization (CPU and memory usage vs capacity) + - All projects/namespaces with their status + - Workload counts: total pods, running vs failing, and any pods in error states +- Explicitly note any contexts that are not OpenShift clusters or could not be reached, and explain why +- Highlight any issues that need attention (unhealthy nodes, resource pressure, failing workloads) + +Use MCP tools to examine the clusters. Write the complete cluster report in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/ocp-admin__cluster-report/solution/solve.sh b/evaluation/with_skills/ocp-admin__cluster-report/solution/solve.sh new file mode 100644 index 00000000..62bd7e47 --- /dev/null +++ b/evaluation/with_skills/ocp-admin__cluster-report/solution/solve.sh @@ -0,0 +1,30 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Multi-Cluster Health Report + +## Cluster Discovery +Use configuration_contexts_list for kubeconfig contexts. Verify each with resources_get(apiVersion="config.openshift.io/v1", kind="ClusterVersion", name="version"). + +## Cluster Contexts +| Context | Type | Server | +|---------|------|--------| +| ocp-prod | OpenShift (ClusterVersion detected) | https://api.ocp-prod.example.com:6443 | + +### OpenShift Detection +Check for ClusterVersion resource: config.openshift.io/v1. Non-OpenShift contexts excluded by default. + +## Node Resources +| Node | CPU | Memory | GPUs | +|------|-----|--------|------| +| worker-01 | 16 cores (45% used) | 64Gi (60% used) | 2 | +| worker-02 | 16 cores (30% used) | 64Gi (40% used) | 0 | + +## Pod Status +| Namespace | Running | Pending | Failed | +|-----------|---------|---------|--------| +| default | 5 | 0 | 0 | +| openshift-operators | 12 | 0 | 1 | + +### Generated using assemble.py and aggregate.py helper scripts +Persist MCP output to /tmp/cluster-report/. Manifest with $file refs. Projects_list (fallback namespaces_list for non-OpenShift) +REPORT_EOF diff --git a/evaluation/with_skills/ocp-admin__cluster-report/task.toml b/evaluation/with_skills/ocp-admin__cluster-report/task.toml new file mode 100644 index 00000000..51a06299 --- /dev/null +++ b/evaluation/with_skills/ocp-admin__cluster-report/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "ocp-admin__cluster-report" +name = "ocp-admin Multi-Cluster Health Report Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["ocp-admin", "cluster-report", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/ocp-admin__cluster-report/tests/llm_judge.py b/evaluation/with_skills/ocp-admin__cluster-report/tests/llm_judge.py new file mode 100644 index 00000000..6c379f29 --- /dev/null +++ b/evaluation/with_skills/ocp-admin__cluster-report/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "clusterversion_verification", + "file": "/root/report.md", + "question": "Does the report explicitly mention probing or checking the ClusterVersion resource as the method used to determine whether each context is an OpenShift cluster?", + "reference": "A skilled report should mention 'ClusterVersion' as the API resource used to verify OpenShift status. It should explain that dev-k8s was classified as non-OpenShift because no ClusterVersion resource was found. Simply saying 'vanilla Kubernetes' or 'not OpenShift' without mentioning the ClusterVersion verification mechanism is insufficient." + }, + { + "id": "exclusion_methodology", + "file": "/root/report.md", + "question": "Does the report treat non-OpenShift clusters (like dev-k8s) as EXCLUDED from the detailed report — listing them briefly in an exclusion section — rather than including them as full sections with node/pod details?", + "reference": "A skilled report should have a separate 'Excluded Clusters' or 'Non-OpenShift' section where dev-k8s is listed briefly with the reason for exclusion. A report that includes dev-k8s as a full section with node details, namespaces, and pod data is NOT demonstrating the skill's exclusion methodology." + }, + { + "id": "aggregated_totals", + "file": "/root/report.md", + "question": "Does the report include aggregated totals across all OpenShift clusters — total nodes, total CPU, total memory, total GPUs — in a comparison or summary table?", + "reference": "A skilled report should have a comparison table with a 'Total' row showing aggregate counts (e.g., 13 nodes total, 148 CPU cores, 592 GiB memory, 8 GPUs). Reports that list each cluster's data without cross-cluster aggregation are insufficient." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/ocp-admin__cluster-report/tests/test.sh b/evaluation/with_skills/ocp-admin__cluster-report/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/ocp-admin__cluster-report/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/ocp-admin__cluster-report/tests/test_outputs.py b/evaluation/with_skills/ocp-admin__cluster-report/tests/test_outputs.py new file mode 100644 index 00000000..5c65747c --- /dev/null +++ b/evaluation/with_skills/ocp-admin__cluster-report/tests/test_outputs.py @@ -0,0 +1,105 @@ +""" +Tests for ocp-admin__cluster-report per-skill evaluation. +Baseline tests: any competent agent should pass. +Skill-dependent tests: based on empirical gaps between skilled and unskilled agent outputs. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_cluster(self): + content = read_report().lower() + assert any(t in content for t in ["cluster", "openshift", "node"]), ( + "report should mention cluster" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_clusterversion_resource(self): + """Skill teaches to probe the ClusterVersion resource to verify OpenShift. + Without skill, agents say 'vanilla Kubernetes' without mentioning the mechanism.""" + c = read_report().lower() + assert "clusterversion" in c or "cluster version resource" in c, ( + "should mention ClusterVersion resource as the OpenShift verification method" + ) + + def test_aggregated_cross_cluster_totals(self): + """Skill teaches a comparison table with aggregated totals across clusters. + Without skill, agents report each cluster separately without totals.""" + c = read_report().lower() + has_total_label = "total" in c or "aggregate" in c or "combined" in c + has_aggregate_context = any(t in c for t in [ + "total node", "total cpu", "total memory", "total gpu", + "across cluster", "combined resource", "aggregate", + ]) or (has_total_label and any(t in c for t in ["node", "cpu", "core", "memory", "gi"])) + assert has_total_label and has_aggregate_context, ( + "should include aggregated cross-cluster totals (total nodes, CPU, memory)" + ) + + def test_non_openshift_exclusion(self): + """Skill teaches to EXCLUDE non-OpenShift clusters from detailed reporting. + Without skill, agents include dev-k8s as a full section with nodes/pods/namespaces.""" + c = read_report().lower() + has_exclusion = any(t in c for t in [ + "excluded", "exclude", "excluded by default", "not included", + "omitted", "non-openshift", + ]) + assert has_exclusion and "dev-k8s" in c, ( + "should explicitly exclude non-OpenShift clusters from detailed data" + ) + + def test_unreachable_reporting(self): + """Both agents should mention unreachable clusters, but skill teaches categorization.""" + c = read_report().lower() + assert "legacy-dc" in c and any(t in c for t in [ + "unreachable", "connection refused", "offline", + ]), "should report legacy-dc as unreachable" + + def test_gpu_inventory(self): + """Skill template includes GPU column — moderate discriminator.""" + c = read_report().lower() + assert "gpu" in c, "should include GPU information" + + def test_version_numbers(self): + """Both agents get versions from MCP, but skill ensures all clusters are covered.""" + c = read_report() + versions = sum(1 for v in ["4.16.3", "4.15.12", "4.16.1"] if v in c) + assert versions >= 2, "should report exact version numbers for multiple clusters" + + def test_multi_cluster_tooling(self): + """Docs teach multi-cluster tooling/automation for consistent reporting. + Without docs, agents rely on manual kubectl context switching.""" + c = read_report().lower() + assert any(t in c for t in [ + "build-kubeconfig", "kubeconfig.py", "cluster-reporter", + "multi-cluster", "multiple context", "all contexts", + "setup script", "automation", + ]), "should reference multi-cluster tooling or automation approach" + + def test_rbac_for_reporting(self): + """Docs teach read-only RBAC (ClusterRole/ServiceAccount) for cluster reporting + instead of admin credentials.""" + c = read_report().lower() + assert any(t in c for t in [ + "cluster-reporter-readonly", "cluster-reporter-system", + "readonly", "read-only", "clusterrole", + "service account", "serviceaccount", "rbac", + "least privilege", "non-admin", + ]), "should reference read-only RBAC for cluster reporting" diff --git a/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/Dockerfile b/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/Dockerfile new file mode 100644 index 00000000..11301417 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/Dockerfile @@ -0,0 +1,78 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + }, \ + "observability": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-observability-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-observability-mcp.py b/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-observability-mcp.py new file mode 100644 index 00000000..f150dcff --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-observability-mcp.py @@ -0,0 +1,260 @@ +#!/usr/bin/env python3 +"""Mock Observability MCP server for SkillsBench rh-ai-engineer__ai-observability task. + +Simulates Prometheus/Grafana-style metrics for inference services: latency, +throughput, error rates, GPU utilization, resource usage, and alerts. + +Scenario (aligned with rhoai/openshift mocks): +- ml-production namespace: + - text-gen-legacy (Mistral 7B on vLLM): OOMKilled; before crash: 22GB/24GB GPU, + p99=2800ms, throughput=3 req/s, error rate=15% + - nim-llama-prod (Llama 3.1 8B on NIM): not running, no metrics (empty/error) + - sentiment-classifier: running well, 4GB/24GB GPU, p99=45ms, throughput=150 req/s, + error rate=0.1% +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("observability") + +# ── Mock metrics data ────────────────────────────────────────────────────── + +# text-gen-legacy: OOMKilled, metrics from before crash +MODEL_METRICS = { + "ml-production": { + "text-gen-legacy": { + "status": "OOMKilled", + "latency_ms": {"p50": 1200, "p95": 2100, "p99": 2800}, + "throughput_req_per_sec": 3.0, + "error_rate_percent": 15.0, + "input_tokens_per_sec": 45, + "output_tokens_per_sec": 12, + "total_requests_24h": 259200, # 3 * 86400 + }, + "nim-llama-prod": None, # not running, no metrics + "sentiment-classifier": { + "status": "Running", + "latency_ms": {"p50": 18, "p95": 38, "p99": 45}, + "throughput_req_per_sec": 150.0, + "error_rate_percent": 0.1, + "input_tokens_per_sec": 1200, + "output_tokens_per_sec": 50, + "total_requests_24h": 12960000, + }, + }, +} + +GPU_UTILIZATION = { + "ml-production": [ + { + "pod": "text-gen-legacy-predictor-00001-abc12", + "model": "text-gen-legacy", + "gpu_memory_used_gb": 22.0, + "gpu_memory_total_gb": 24.0, + "gpu_memory_utilization_percent": 91.7, + "gpu_compute_utilization_percent": 35.0, + "status": "OOMKilled", + }, + { + "pod": "sentiment-classifier-predictor-00001-xyz99", + "model": "sentiment-classifier", + "gpu_memory_used_gb": 4.0, + "gpu_memory_total_gb": 24.0, + "gpu_memory_utilization_percent": 16.7, + "gpu_compute_utilization_percent": 42.0, + "status": "Running", + }, + # nim-llama-prod: no pod + ], +} + +RESOURCE_USAGE = { + "ml-production": [ + { + "pod": "text-gen-legacy-predictor-00001-abc12", + "model": "text-gen-legacy", + "cpu_request": "4", + "cpu_limit": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "cpu_actual_usage": "3.2", + "memory_actual_usage_mib": 16384, + "status": "CrashLoopBackOff", + }, + { + "pod": "sentiment-classifier-predictor-00001-xyz99", + "model": "sentiment-classifier", + "cpu_request": "2", + "cpu_limit": "4", + "memory_request": "8Gi", + "memory_limit": "16Gi", + "cpu_actual_usage": "1.1", + "memory_actual_usage_mib": 4096, + "status": "Running", + }, + ], +} + +PROMETHEUS_ALERTS = { + "ml-production": [ + { + "name": "InferenceServiceOOMKilled", + "severity": "critical", + "state": "firing", + "summary": "text-gen-legacy predictor pod OOMKilled", + "description": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "labels": { + "inference_service": "text-gen-legacy", + "namespace": "ml-production", + }, + }, + { + "name": "HighInferenceLatency", + "severity": "warning", + "state": "firing", + "summary": "text-gen-legacy p99 latency > 2000ms", + "description": "Inference latency p99 is 2800ms, exceeding threshold of 2000ms.", + "labels": { + "inference_service": "text-gen-legacy", + "namespace": "ml-production", + }, + }, + { + "name": "HighErrorRate", + "severity": "warning", + "state": "firing", + "summary": "text-gen-legacy error rate 15%", + "description": "Inference error rate is 15%, exceeding threshold of 5%.", + "labels": { + "inference_service": "text-gen-legacy", + "namespace": "ml-production", + }, + }, + ], +} + + +# ── Tools ────────────────────────────────────────────────────────────────── + + +@mcp.tool() +def query_model_metrics( + model_name: str, + namespace: str, + metric_type: str = "all", +) -> str: + """Query inference metrics for a model. Returns latency (p50/p95/p99), throughput + (requests/sec), error rates, and token counts. + + metric_type: 'all', 'latency', 'throughput', 'errors', or 'tokens' + """ + ns_data = MODEL_METRICS.get(namespace) + if not ns_data: + return json.dumps({"error": f"Namespace '{namespace}' not found"}, indent=2) + + metrics = ns_data.get(model_name) + if metrics is None: + return json.dumps({ + "error": f"No metrics for model '{model_name}' in namespace '{namespace}'. " + "Model may not be running (e.g., nim-llama-prod has no pods).", + "model_name": model_name, + "namespace": namespace, + }, indent=2) + + result = { + "model_name": model_name, + "namespace": namespace, + "status": metrics["status"], + } + + if metric_type in ("all", "latency"): + result["latency_ms"] = metrics["latency_ms"] + if metric_type in ("all", "throughput"): + result["throughput_req_per_sec"] = metrics["throughput_req_per_sec"] + result["total_requests_24h"] = metrics.get("total_requests_24h") + if metric_type in ("all", "errors"): + result["error_rate_percent"] = metrics["error_rate_percent"] + if metric_type in ("all", "tokens"): + result["input_tokens_per_sec"] = metrics["input_tokens_per_sec"] + result["output_tokens_per_sec"] = metrics["output_tokens_per_sec"] + + return json.dumps(result, indent=2) + + +@mcp.tool() +def query_gpu_utilization(namespace: str) -> str: + """Query GPU memory used/total and compute utilization per inference pod.""" + pods = GPU_UTILIZATION.get(namespace, []) + if not pods: + return json.dumps({ + "namespace": namespace, + "pods": [], + "message": "No GPU-backed inference pods found in namespace.", + }, indent=2) + return json.dumps({ + "namespace": namespace, + "pods": pods, + }, indent=2) + + +@mcp.tool() +def query_resource_usage(namespace: str) -> str: + """Query actual CPU/memory usage vs requests/limits for inference pods.""" + pods = RESOURCE_USAGE.get(namespace, []) + if not pods: + return json.dumps({ + "namespace": namespace, + "pods": [], + "message": "No inference pods found in namespace.", + }, indent=2) + return json.dumps({ + "namespace": namespace, + "pods": pods, + }, indent=2) + + +@mcp.tool() +def list_prometheus_alerts(namespace: str) -> str: + """List firing Prometheus alerts related to inference services in the namespace.""" + alerts = PROMETHEUS_ALERTS.get(namespace, []) + return json.dumps({ + "namespace": namespace, + "alerts": alerts, + "firing_count": len(alerts), + }, indent=2) + + +@mcp.tool() +def get_model_performance_summary(namespace: str) -> str: + """Get aggregated performance data across all models in the namespace.""" + ns_data = MODEL_METRICS.get(namespace) + if not ns_data: + return json.dumps({"error": f"Namespace '{namespace}' not found"}, indent=2) + + models = [] + for name, metrics in ns_data.items(): + if metrics is None: + models.append({ + "model_name": name, + "status": "NotRunning", + "error": "No metrics available (pod not created or not running)", + }) + else: + models.append({ + "model_name": name, + "status": metrics["status"], + "latency_p99_ms": metrics["latency_ms"]["p99"], + "throughput_req_per_sec": metrics["throughput_req_per_sec"], + "error_rate_percent": metrics["error_rate_percent"], + }) + + return json.dumps({ + "namespace": namespace, + "models": models, + }, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..e7a4d11c --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,457 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..0ae9e4cb --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,780 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__ai-observability/instruction.md b/evaluation/with_skills/rh-ai-engineer__ai-observability/instruction.md new file mode 100644 index 00000000..f76c1829 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ai-observability/instruction.md @@ -0,0 +1,13 @@ +# AI Observability Task + +You are an AI engineer on Red Hat OpenShift AI. Your team has deployed several inference services, but has no visibility into how they are performing or whether resources are sized correctly. + +## Requirements +- Assess the current state of deployed inference services and their resource consumption +- Define a metrics strategy covering: inference latency, throughput, error rates, and GPU memory utilization +- Identify any models that appear over-provisioned or under-provisioned based on current usage +- Recommend specific resource adjustments (CPU, memory, GPU, replicas) with justification + +Document your observability strategy and resource recommendations in `/root/report.md`. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-ai-engineer__ai-observability/solution/solve.sh b/evaluation/with_skills/rh-ai-engineer__ai-observability/solution/solve.sh new file mode 100644 index 00000000..d319c204 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ai-observability/solution/solve.sh @@ -0,0 +1,23 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# AI Observability Report + +## Model: fraud-detection + +### GPU Utilization (last 15m) +- GPU memory: 65% utilized +- GPU compute: 45% utilized +- Recommendation: GPU is underutilized, consider right-sizing + +### Model Latency +- P50: 120ms +- P99: 450ms + +### Right-Sizing +- Current: 1x A100 80GB +- Recommended: 1x A100 40GB (sufficient for workload) + +### Advanced Observability +- execute_promql for custom metrics (e.g., vllm:request_success:ratio) +- query_tempo_tool for trace latency on slow requests +REPORT_EOF diff --git a/evaluation/with_skills/rh-ai-engineer__ai-observability/task.toml b/evaluation/with_skills/rh-ai-engineer__ai-observability/task.toml new file mode 100644 index 00000000..130dece1 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ai-observability/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__ai-observability" +name = "rh-ai-engineer AI Observability Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "ai-observability", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-ai-engineer__ai-observability/tests/llm_judge.py b/evaluation/with_skills/rh-ai-engineer__ai-observability/tests/llm_judge.py new file mode 100644 index 00000000..28b8d6a0 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ai-observability/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "observability_stack_awareness", + "file": "/root/report.md", + "question": "Does the report recommend specific observability tools from the OpenShift AI ecosystem — specifically Tempo for distributed tracing AND Korrel8r for cross-domain correlation — as part of the observability strategy?", + "reference": "A skilled report should mention BOTH Tempo (for distributed tracing of inference requests) and Korrel8r (for correlating signals across logs, metrics, traces, and alerts). These are specific tools the skill teaches. A report that only recommends generic Prometheus/Grafana monitoring without Tempo or Korrel8r is insufficient." + }, + { + "id": "gpu_metric_specificity", + "file": "/root/report.md", + "question": "Does the report reference DCGM (Data Center GPU Manager) metric names (like DCGM_FI_DEV_FB_USED or DCGM_FI_DEV_GPU_UTIL) for GPU monitoring, rather than generic nvidia_gpu_memory metric names?", + "reference": "A skilled report uses DCGM-specific metric names (DCGM_FI_DEV_*) which are the actual metrics exposed by the GPU operator on OpenShift. Using generic names like nvidia_gpu_memory_used_bytes suggests the agent doesn't know the specific metric naming convention." + }, + { + "id": "vllm_tuning_specificity", + "file": "/root/report.md", + "question": "Does the report recommend specific vLLM configuration parameters (like --max-model-len, --gpu-memory-utilization, or tensor parallelism) for resolving GPU memory issues, rather than only recommending generic resource increases?", + "reference": "A skilled report should mention vLLM-specific tuning args like --max-model-len to limit KV cache size, --gpu-memory-utilization to control memory allocation, or tensor parallelism for multi-GPU distribution. Only recommending 'increase memory to 32Gi' without vLLM-specific configuration is insufficient." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-ai-engineer__ai-observability/tests/test.sh b/evaluation/with_skills/rh-ai-engineer__ai-observability/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ai-observability/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-ai-engineer__ai-observability/tests/test_outputs.py b/evaluation/with_skills/rh-ai-engineer__ai-observability/tests/test_outputs.py new file mode 100644 index 00000000..eb3755b2 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ai-observability/tests/test_outputs.py @@ -0,0 +1,91 @@ +""" +Tests for rh-ai-engineer__ai-observability per-skill evaluation. +Baseline tests: any competent agent should pass. +Skill-dependent tests: based on empirical gaps between skilled and unskilled agent outputs. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["monitor", "metric", "observ", "inference"]), ( + "report should mention monitoring or observability" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_tempo_distributed_tracing(self): + """Skill teaches Tempo for distributed tracing of inference requests. + Without skill, agents don't mention Tempo at all.""" + c = read_report().lower() + assert any(t in c for t in ["tempo", "distributed trac"]), ( + "should recommend Tempo for distributed tracing" + ) + + def test_korrel8r_correlation(self): + """Skill teaches Korrel8r for cross-domain signal correlation. + Without skill, agents don't know about Korrel8r.""" + c = read_report().lower() + assert any(t in c for t in ["korrel8r", "cross-domain correlation"]), ( + "should mention Korrel8r for cross-domain correlation" + ) + + def test_dcgm_gpu_metric_names(self): + """Skill teaches DCGM-specific GPU metric names (DCGM_FI_DEV_*). + Without skill, agents use generic nvidia_gpu_memory_* names.""" + c = read_report() + assert any(t in c for t in ["DCGM_FI_DEV", "dcgm_fi_dev", "DCGM"]), ( + "should reference DCGM GPU metric names (not generic nvidia_gpu_*)" + ) + + def test_opentelemetry_instrumentation(self): + """Skill teaches OpenTelemetry for trace instrumentation on inference endpoints. + Without skill, agents don't mention OpenTelemetry.""" + c = read_report().lower() + assert any(t in c for t in ["opentelemetry", "otel"]), ( + "should recommend OpenTelemetry instrumentation" + ) + + def test_vllm_tuning_args(self): + """Skill teaches vLLM CLI args for memory management. + Without skill, agents recommend generic resource increases but not vLLM-specific tuning.""" + c = read_report().lower() + assert any(t in c for t in [ + "max-model-len", "max_model_len", "gpu-memory-utilization", + "gpu_memory_utilization", "tensor parallel", "tensor_parallel", + ]), "should mention vLLM-specific configuration args for resource tuning" + + def test_latency_percentiles(self): + """Both agents should report latency percentiles (easy test).""" + c = read_report().lower() + assert any(t in c for t in ["p50", "p95", "p99"]), ( + "should report latency with percentiles" + ) + + def test_tensor_parallel_size_tuning(self): + """Docs teach reducing --tensor-parallel-size as GPU scheduling triage step, + and OOM mitigation via --max-model-len and quantized models (AWQ/GPTQ/FP8). + Without docs, agents don't know these vLLM tuning parameters.""" + c = read_report().lower() + assert any(t in c for t in [ + "tensor-parallel-size", "tensor_parallel_size", "tensor parallel", + "awq", "gptq", "fp8", "quantiz", + ]), "should address tensor-parallel-size and quantization for GPU tuning" diff --git a/evaluation/with_skills/rh-ai-engineer__debug-inference/environment/Dockerfile b/evaluation/with_skills/rh-ai-engineer__debug-inference/environment/Dockerfile new file mode 100644 index 00000000..d4978abe --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__debug-inference/environment/Dockerfile @@ -0,0 +1,74 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..e7a4d11c --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,457 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/with_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..0ae9e4cb --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,780 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__debug-inference/instruction.md b/evaluation/with_skills/rh-ai-engineer__debug-inference/instruction.md new file mode 100644 index 00000000..11b9268d --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__debug-inference/instruction.md @@ -0,0 +1,13 @@ +# Inference Debugging Task + +You are an AI engineer on Red Hat OpenShift AI. There are failing model inference deployments in the `ml-production` namespace that need debugging. + +## Requirements +- List all InferenceServices in the `ml-production` namespace and identify which ones are not ready +- For each failing InferenceService, diagnose the root cause: check status conditions, pod state, container logs, events, and related resources (ServingRuntime, Account CRs) +- Recommend a specific fix for each failing deployment +- Document your methodology and the diagnostic steps you followed + +Use MCP tools to interact with the platform. Write your complete findings and recommendations in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-ai-engineer__debug-inference/solution/solve.sh b/evaluation/with_skills/rh-ai-engineer__debug-inference/solution/solve.sh new file mode 100644 index 00000000..6b94e02f --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__debug-inference/solution/solve.sh @@ -0,0 +1,36 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Inference Debug Report + +## Diagnosis Categories (get_inference_service verbosity full) + +### 1. ServingRuntime ✓ +ServingRuntime CR exists and is valid + +### 2. Pod Scheduling ✗ +Pod cannot be scheduled — check kserve-container logs (pods_log container=kserve-container) +Label selector: serving.kserve.io/inferenceservice + +### 3. Container Start +KServe sidecar containers may conflict with LimitRange + +### 4. Model Loading +Check model download and initialization + +### 5. GPU Access +Verify GPU allocation and CUDA compatibility + +### 6. Endpoint Health +Check InferenceService URL and readiness (PredictorReady, IngressReady conditions) + +## Events +events_list filtered by namespace for pod/InferenceService events + +## NIM Deployments +For NIM: Check Account CR (nim.opendatahub.io) for NGC credential errors + +## Observability (optional) +- korrel8r_get_correlated for cross-domain signals +- query_tempo_tool for trace latency +- execute_promql for custom metrics +REPORT_EOF diff --git a/evaluation/with_skills/rh-ai-engineer__debug-inference/task.toml b/evaluation/with_skills/rh-ai-engineer__debug-inference/task.toml new file mode 100644 index 00000000..44c5ea2a --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__debug-inference/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__debug-inference" +name = "rh-ai-engineer Inference Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "debug-inference", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-ai-engineer__debug-inference/tests/llm_judge.py b/evaluation/with_skills/rh-ai-engineer__debug-inference/tests/llm_judge.py new file mode 100644 index 00000000..748256ca --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__debug-inference/tests/llm_judge.py @@ -0,0 +1,114 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "kserve_conditions_awareness", + "file": "/root/report.md", + "question": "Does the report explicitly name the KServe InferenceService status conditions — specifically PredictorReady and IngressReady — and present them in a structured conditions table with Status/Reason/Message columns?", + "reference": "A skilled report should present a conditions table showing PredictorReady and IngressReady as distinct conditions with their status (True/False), reason, and message. Simply reporting 'CrashLoopBackOff' or 'pod failing' without naming the specific KServe conditions is insufficient." + }, + { + "id": "kserve_container_specificity", + "file": "/root/report.md", + "question": "Does the report mention 'kserve-container' by name as the specific container to inspect for logs, and reference the serving.kserve.io/inferenceservice label selector as the method for discovering predictor pods?", + "reference": "A skilled report should mention 'kserve-container' as the container name for log inspection and reference the serving.kserve.io/inferenceservice label selector for pod discovery. Generically saying 'check pod logs' or 'look at the container' without these specific KServe identifiers is insufficient." + }, + { + "id": "nim_account_cr_pattern", + "file": "/root/report.md", + "question": "Does the report prescribe creating a NIM Account custom resource (kind: Account) as the credential management mechanism for NVIDIA NIM, rather than only manually creating docker-registry secrets and patching service accounts?", + "reference": "A skilled report creates a NIM Account CR (kind: Account, apiVersion: nvidia.com/v1alpha1) with ngcSecret reference and imagePullSecret auto-creation. An unskilled report manually creates docker-registry secrets and patches service accounts without using the Account CR pattern." + }, + { + "id": "ngc_credential_expiry", + "file": "/root/report.md", + "question": "Does the report identify NGC API key or pull-secret expiry as a possible root cause for image pull failures in NIM deployments, and recommend checking the secret's expiration date?", + "reference": "A skilled report checks whether the NGC pull-secret has expired as a diagnosis step for ImagePullBackOff. An unskilled report treats image pull failures generically without considering credential expiry." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-ai-engineer__debug-inference/tests/test.sh b/evaluation/with_skills/rh-ai-engineer__debug-inference/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__debug-inference/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-ai-engineer__debug-inference/tests/test_outputs.py b/evaluation/with_skills/rh-ai-engineer__debug-inference/tests/test_outputs.py new file mode 100644 index 00000000..60f73901 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__debug-inference/tests/test_outputs.py @@ -0,0 +1,98 @@ +""" +Tests for rh-ai-engineer__debug-inference per-skill evaluation. +Baseline tests: any competent agent should pass. +Skill-dependent tests: based on empirical gaps between skilled and unskilled agent outputs. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["inference", "model", "serving", "deploy"]), ( + "report should mention inference" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_kserve_status_conditions(self): + """Skill teaches presenting PredictorReady and IngressReady as distinct KServe conditions. + Without skill, agents report generic pod status (CrashLoopBackOff) without naming these conditions.""" + c = read_report().lower() + assert any(t in c for t in [ + "predictorready", "predictor ready", "predictor_ready", + "ingressready", "ingress ready", "ingress_ready", + ]), "should name KServe status conditions (PredictorReady, IngressReady)" + + def test_kserve_container_name(self): + """Skill teaches 'kserve-container' as the specific container for log inspection. + Without skill, agents check logs generically without naming this container.""" + c = read_report().lower() + assert "kserve-container" in c or "kserve container" in c, ( + "should mention kserve-container by name as the container to inspect" + ) + + def test_label_selector_methodology(self): + """Skill teaches using serving.kserve.io/inferenceservice label to find predictor pods. + Without skill, agents discover pods through generic namespace listing.""" + c = read_report().lower() + assert any(t in c for t in [ + "serving.kserve.io", "kserve.io/inferenceservice", + ]), "should reference the KServe label selector for predictor pod discovery" + + def test_account_cr_awareness(self): + """Skill teaches NIM Account CR as the credential management mechanism. + Without skill, agents manually create docker-registry secrets and + patch service accounts instead of using the Account custom resource.""" + c = read_report() + assert any(t in c for t in [ + "Account CR", "kind: Account", "Account resource", + "Account custom resource", + ]) or "account cr" in c.lower(), ( + "should reference NIM Account CR as credential management mechanism" + ) + + def test_nim_api_version(self): + """Skill teaches the nvidia.com API group for NIM Account and ngcSecret + field for NGC credential binding. Without skill, agents create + generic secrets without the Account CR pattern.""" + c = read_report().lower() + assert any(t in c for t in [ + "nvidia.com/v1alpha1", "ngcsecret", "ngc_api_key", + ]) or ("account" in c and "api" in c and "nvidia" in c), ( + "should reference NIM Account API version or NGC secret binding" + ) + + def test_root_cause_with_remediation(self): + """Both agents should link diagnosis to fix — easy test.""" + c = read_report().lower() + has_diagnosis = any(t in c for t in ["oom", "memory", "crash", "fail"]) + has_fix = any(t in c for t in ["fix", "recommend", "solution", "increase", "reduce"]) + assert has_diagnosis and has_fix, "should link diagnosis to recommended fix" + + def test_ngc_pull_secret_expiry(self): + """Docs teach NGC pull-secret expiry as a common issue, and + 'Insufficient nvidia.com/gpu' as GPU scheduling error signature. + Without docs, agents miss these specific failure patterns.""" + c = read_report().lower() + assert any(t in c for t in [ + "ngc", "pull-secret", "pull secret", "expir", + "insufficient nvidia.com/gpu", "nvidia.com/gpu", + ]), "should address NGC pull-secret expiry or GPU scheduling errors" diff --git a/evaluation/with_skills/rh-ai-engineer__ds-project-setup/environment/Dockerfile b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/environment/Dockerfile new file mode 100644 index 00000000..d4978abe --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/environment/Dockerfile @@ -0,0 +1,74 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..e7a4d11c --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,457 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..9b072b37 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,796 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +@mcp.tool() +def setup_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Alias for configure_pipeline_server. Configure a pipeline server for a Data Science Project.""" + return configure_pipeline_server(namespace, data_connection, database) + + +@mcp.tool() +def get_pipeline_status(namespace: str) -> str: + """Alias for get_pipeline_server_status. Get the status of the pipeline server.""" + return get_pipeline_server_status(namespace) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__ds-project-setup/instruction.md b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/instruction.md new file mode 100644 index 00000000..03edef95 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/instruction.md @@ -0,0 +1,20 @@ +# Data Science Project Setup Task + +You are an AI engineer on Red Hat OpenShift AI. Your team needs a new project to host a fraud detection model, with access to an S3 model store and model serving enabled. + +## Requirements +- Create a new data science project named `fraud-detection` with a descriptive display name and appropriate labels +- Configure an S3-compatible data connection for model artifact storage using these details: + - Connection name: `model-store` + - Bucket: `fraud-models` + - Endpoint: `https://s3.us-east-1.amazonaws.com` + - Access key: `AKIA1234EXAMPLE` + - Secret key: `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY` + - Region: `us-east-1` +- Enable model serving on the project and confirm which serving runtimes become available +- Configure the pipeline server using the data connection you created +- Verify the project is fully configured and ready for model deployment + +Document your setup steps, the final project configuration, and any issues encountered in `/root/report.md`. Redact any credentials in the report. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-ai-engineer__ds-project-setup/solution/solve.sh b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/solution/solve.sh new file mode 100644 index 00000000..91e7d226 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/solution/solve.sh @@ -0,0 +1,32 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Data Science Project Setup + +## Project: fraud-detection + +### Namespace Configuration +- Label: `opendatahub.io/dashboard: "true"` (required for RHOAI dashboard visibility) +- Model serving mode: `single` (one model per namespace) + +### S3 Data Connection +- Name: model-artifacts +- Bucket: fraud-detection-models +- Endpoint: https://s3.amazonaws.com +- Access key / Secret key configured (credentials REDACTED in display) +- Region: us-east-1 + +### Pipeline Server +- Data connection: model-artifacts (required for pipeline artifact storage) +- Pipeline server uses data connection for artifacts + +### Project Status (get_project_status) +| Component | Status | +|-----------|--------| +| Project | fraud-detection (created) | +| Data connections | 1 configured | +| Pipeline server | configured | +| Model serving | single mode enabled | + +### Validation +- list_data_science_projects: checked for duplicate project name before create +REPORT_EOF diff --git a/evaluation/with_skills/rh-ai-engineer__ds-project-setup/task.toml b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/task.toml new file mode 100644 index 00000000..83c47b69 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__ds-project-setup" +name = "rh-ai-engineer Data Science Project Setup Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "ds-project-setup", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-ai-engineer__ds-project-setup/tests/llm_judge.py b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/tests/llm_judge.py new file mode 100644 index 00000000..2d6a34c6 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/tests/llm_judge.py @@ -0,0 +1,102 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "partial_credential_redaction", + "file": "/root/report.md", + "question": "Does the report display S3 credentials using partial redaction — showing the first few characters of the access key (like AKIA****) with the rest masked — rather than using placeholder values or full redaction?", + "reference": "A skilled report shows credentials partially redacted: the first few characters of the access key visible followed by asterisks (e.g., 'AKIA****'), and the secret key fully masked ('********'). Using PLACEHOLDER_ACCESS_KEY or completely hiding the access key is insufficient — partial redaction allows verification without exposing the full credential." + }, + { + "id": "secret_manifest_structure", + "file": "/root/report.md", + "question": "Does the report include a Kubernetes Secret manifest (with kind: Secret, apiVersion, metadata, and data fields) showing how the S3 data connection is stored as a K8s resource, rather than just describing the connection narratively?", + "reference": "A skilled report shows the actual K8s Secret YAML structure with kind: Secret, metadata (namespace, name, labels), and data fields containing base64-encoded values. An unskilled report describes the data connection configuration narratively without showing the underlying K8s resource structure." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-ai-engineer__ds-project-setup/tests/test.sh b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-ai-engineer__ds-project-setup/tests/test_outputs.py b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/tests/test_outputs.py new file mode 100644 index 00000000..8978be1d --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__ds-project-setup/tests/test_outputs.py @@ -0,0 +1,113 @@ +""" +Tests for rh-ai-engineer__ds-project-setup per-skill evaluation. +Baseline tests: any competent agent should pass. +Skill-dependent tests: based on empirical gaps between skilled and unskilled agent outputs. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["project", "data science", "namespace"]), ( + "report should mention the project" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_data_connection_secret_keys(self): + """Skill teaches RHOAI data connections are stored as K8s Secrets with specific + key names: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_S3_BUCKET, + AWS_S3_ENDPOINT. Without skill, agents describe connections abstractly.""" + c = read_report() + aws_keys = ["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "AWS_S3_BUCKET", + "AWS_S3_ENDPOINT", "AWS_DEFAULT_REGION"] + mentioned = sum(1 for k in aws_keys if k in c) + assert mentioned >= 2, ( + "should reference specific RHOAI data connection secret key names (AWS_*)" + ) + + def test_credential_partial_redaction(self): + """Skill teaches showing first 4 chars + **** for credentials (e.g., AKIA****). + Without skill, agents use PLACEHOLDER values or full redaction.""" + c = read_report() + has_partial = any(t in c for t in [ + "AKIA****", "AKIA*", "wJal****", "wJal*", + "1234****", "1234*", + ]) + has_stars_with_prefix = "****" in c and any(t in c for t in ["AKIA", "akia"]) + assert has_partial or has_stars_with_prefix, ( + "should use partial credential redaction (first chars visible + ****)" + ) + + def test_k8s_secret_yaml_manifest(self): + """Skill teaches showing the K8s Secret manifest structure for data connections. + Without skill, agents describe connections narratively without YAML.""" + c = read_report() + has_secret_kind = "kind: Secret" in c or "kind:Secret" in c + has_secret_ref = "Secret" in c and ("apiVersion" in c or "metadata" in c) + assert has_secret_kind or has_secret_ref, ( + "should include K8s Secret manifest structure for data connection" + ) + + def test_pipeline_server_with_data_connection(self): + """Skill teaches pipeline server requires a data connection (prerequisite chain). + Without skill, agents skip pipeline server or configure it generically.""" + c = read_report().lower() + has_pipeline = any(t in c for t in ["pipeline server", "pipeline"]) + has_linkage = any(t in c for t in [ + "data connection", "model-store", "artifact storage", + "s3 bucket", "data_connection", + ]) + pipeline_configured = "pipeline" in c and "configured" in c and "not configured" not in c + assert has_pipeline and (has_linkage or pipeline_configured), ( + "should configure pipeline server linked to a data connection" + ) + + def test_base64_secret_values(self): + """Skill teaches showing actual base64-encoded secret values in K8s + Secret YAML manifests. Without skill, agents show credentials in + plain text or fully redacted format.""" + c = read_report() + import re + has_base64 = bool(re.search(r'[A-Za-z0-9+/]{12,}={0,2}', c)) + has_opaque = "Opaque" in c + assert has_base64 or has_opaque, ( + "should include base64-encoded values or Opaque secret type in K8s manifest" + ) + + def test_model_serving_mode(self): + """Both agents should configure model serving — easy test.""" + c = read_report().lower() + assert any(t in c for t in [ + "single", "multi", "model serving", "serving mode", + ]), "should configure model serving mode" + + def test_runtime_selection_context(self): + """Docs teach decision context across runtimes: vLLM (PagedAttention), + NIM (TensorRT-LLM, no compilation), Caikit+TGIS (gRPC-only). + Without docs, agents don't provide runtime comparison context.""" + c = read_report().lower() + assert any(t in c for t in [ + "pagedattention", "paged attention", "tensorrt", "grpc", + "caikit", "vllm", "nim", + ]) and any(t in c for t in ["runtime", "serving", "comparison", "select"]), ( + "should compare runtimes with technical characteristics" + ) diff --git a/evaluation/with_skills/rh-ai-engineer__model-deploy/environment/Dockerfile b/evaluation/with_skills/rh-ai-engineer__model-deploy/environment/Dockerfile new file mode 100644 index 00000000..d4978abe --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__model-deploy/environment/Dockerfile @@ -0,0 +1,74 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..e7a4d11c --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,457 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/with_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..0ae9e4cb --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,780 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__model-deploy/instruction.md b/evaluation/with_skills/rh-ai-engineer__model-deploy/instruction.md new file mode 100644 index 00000000..44f79a58 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__model-deploy/instruction.md @@ -0,0 +1,15 @@ +# Model Deployment Task + +You are an AI engineer on Red Hat OpenShift AI. Your team has trained models ready to serve and needs them deployed as inference endpoints in the `ml-production` project. + +## Requirements +- Examine the existing project, available serving runtimes, and any existing deployments +- Diagnose any failing deployments: check pod conditions, container status, logs, and events to determine root causes +- For GPU memory issues, provide a VRAM budget analysis showing model weight size, KV cache requirements, and available GPU memory — distinguish GPU VRAM constraints from pod system memory limits +- Before recommending fixes, check the namespace environment for resource policies and GPU node scheduling constraints that could block redeployment +- For each failing deployment, provide a complete KServe InferenceService YAML manifest with your recommended fix +- Produce a deployment plan that addresses all identified issues and gets the models serving successfully + +Document your deployment plan, diagnosed issues, environment validation, and recommended fixes in `/root/report.md`. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-ai-engineer__model-deploy/solution/solve.sh b/evaluation/with_skills/rh-ai-engineer__model-deploy/solution/solve.sh new file mode 100644 index 00000000..05b7171e --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__model-deploy/solution/solve.sh @@ -0,0 +1,63 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Model Deployment Plan + +## Diagnosed Issues + +### GPU VRAM Budget Analysis +The vLLM OOM is a **GPU VRAM constraint**, not a pod system memory issue: +- Model weights: ~13.5 GiB loaded into GPU +- KV cache allocation: ~28.5 GiB (at default max_model_len=32768) +- Available VRAM after model load: ~10.1 GiB on A10G (24576 MiB total) +- **Root cause**: Default max_model_len=32768 causes KV cache to exhaust GPU VRAM +- **Fix**: Set MAX_MODEL_LEN=4096 or GPU_MEMORY_UTILIZATION=0.85 + +### LimitRange Conflict +- Namespace LimitRange min CPU: 100m +- KServe sidecar containers request: 10m CPU, 15Mi memory +- **CONFLICT**: Sidecar resources below LimitRange minimum +- Fix: Adjust LimitRange or use annotation to override + +### GPU Node Taints +- GPU nodes may have taint ai-app=true:NoSchedule +- Add matching tolerations to InferenceService predictor spec + +### NIMAccount Dependency +- NIM deployments require a NIMAccount CR to be ready before ServingRuntime can pull images +- Check for NIMAccountNotReady condition if ImagePullBackOff occurs + +## Recommended InferenceService YAML + +```yaml +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + name: llama-3-8b + namespace: ml-production + annotations: + serving.kserve.io/deploymentMode: RawDeployment +spec: + predictor: + model: + modelFormat: + name: vLLM + runtime: vllm-cuda-runtime + storageUri: "hf://meta-llama/Llama-3-8B" + resources: + requests: + cpu: "4" + memory: "32Gi" + nvidia.com/gpu: "1" + containers: + - name: kserve-container + env: + - name: MAX_MODEL_LEN + value: "4096" + - name: GPU_MEMORY_UTILIZATION + value: "0.85" +``` + +## Endpoint +- get_model_endpoint for inference URL +- vLLM: /v1/completions, KServe v2: /v2/models/[model]/infer +REPORT_EOF diff --git a/evaluation/with_skills/rh-ai-engineer__model-deploy/task.toml b/evaluation/with_skills/rh-ai-engineer__model-deploy/task.toml new file mode 100644 index 00000000..90674851 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__model-deploy/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__model-deploy" +name = "rh-ai-engineer Model Deployment Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "model-deploy", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-ai-engineer__model-deploy/tests/llm_judge.py b/evaluation/with_skills/rh-ai-engineer__model-deploy/tests/llm_judge.py new file mode 100644 index 00000000..5cd7c20e --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__model-deploy/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "vram_budget_methodology", + "file": "/root/report.md", + "question": "Does the report present a specific GPU VRAM budget calculation for the vLLM OOM issue — showing the model weight size (~13.5 GiB), the KV cache allocation requirement (~28.5 GiB), and the available VRAM after model load (~10.1 GiB) — and explicitly state that this is a GPU VRAM constraint, NOT a pod system memory issue?", + "reference": "A skilled report shows a VRAM budget breakdown: model weights (~13.5 GiB) loaded into GPU, KV cache requiring ~28.5 GiB, but only ~10.1 GiB available on the 24 GB A10G after model load. It explicitly distinguishes GPU VRAM from pod memory (system RAM). A report that says 'OOMKilled' and recommends increasing pod memory from 16Gi to 32Gi WITHOUT this GPU VRAM analysis is insufficient." + }, + { + "id": "rhoai_deployment_conventions", + "file": "/root/report.md", + "question": "Does the report use RHOAI-specific deployment conventions such as the RawDeployment annotation and GPU_MEMORY_UTILIZATION environment variable configuration, rather than generic Kubernetes deployment patterns?", + "reference": "A skilled report uses serving.kserve.io/deploymentMode: RawDeployment annotation and configures vLLM tuning parameters (GPU_MEMORY_UTILIZATION, MAX_MODEL_LEN) as environment variables in the InferenceService spec. It also identifies NIMAccount CR dependencies for NIM deployments. A report that uses generic Kubernetes deployments or command-line args without RHOAI-specific annotations is insufficient." + }, + { + "id": "kserve_yaml_manifest", + "file": "/root/report.md", + "question": "Does the report include a complete KServe InferenceService YAML manifest with the serving.kserve.io/v1beta1 apiVersion, including metadata (name, namespace) and spec.predictor with model format, storage URI, resource requests, and GPU count?", + "reference": "A skilled report provides a deployable InferenceService YAML with apiVersion: serving.kserve.io/v1beta1, kind: InferenceService, and a complete spec including predictor with model format, runtime reference, storage URI, resource requests (CPU, memory, GPU), and environment variables (VLLM_MAX_MODEL_LEN). A report that only describes fixes in narrative or MCP tool call format without a formal YAML manifest is insufficient." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-ai-engineer__model-deploy/tests/test.sh b/evaluation/with_skills/rh-ai-engineer__model-deploy/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__model-deploy/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-ai-engineer__model-deploy/tests/test_outputs.py b/evaluation/with_skills/rh-ai-engineer__model-deploy/tests/test_outputs.py new file mode 100644 index 00000000..0669d687 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__model-deploy/tests/test_outputs.py @@ -0,0 +1,94 @@ +""" +Tests for rh-ai-engineer__model-deploy per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["model", "deploy", "inference", "serving"]), ( + "report should mention model deployment" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_vram_budget_analysis(self): + """Skill teaches GPU VRAM budget: model weights (13.5 GiB) + KV cache (28.5 GiB) + exceeds A10G capacity (24 GB). Without skill, agents report OOM with approximate + numbers (~14GB) without KV cache sizing or available VRAM calculation.""" + c = read_report() + assert any(t in c for t in [ + "28.5", "10.1 GiB", "10.1 GB", "24576", + ]), ( + "should include specific VRAM budget numbers " + "(KV cache size ~28.5 GiB, available VRAM ~10.1 GiB, or total GPU VRAM 24576 MiB)" + ) + + def test_default_context_window_32768(self): + """Skill teaches that vLLM default max_model_len=32768 causes KV cache to exhaust + GPU VRAM on A10G. Without skill, agents report OOM without identifying the specific + default value that triggers the oversized KV cache allocation.""" + c = read_report() + assert "32768" in c or "32,768" in c, ( + "should identify max_model_len=32768 as the specific vLLM default causing GPU OOM" + ) + + def test_kserve_yaml_apiversion(self): + """Skill teaches creating InferenceService YAML with serving.kserve.io/v1beta1. + Without skill, agents describe fixes via MCP tool calls or narrative without + providing a formal KServe YAML manifest with the correct apiVersion.""" + c = read_report() + assert "serving.kserve.io/v1beta1" in c, ( + "should include InferenceService YAML manifest with serving.kserve.io/v1beta1 apiVersion" + ) + + def test_raw_deployment_mode(self): + """Skill teaches using serving.kserve.io/deploymentMode: RawDeployment annotation + for RHOAI model deployments. Without skill, agents omit this RHOAI-specific + annotation, which controls how KServe deploys the predictor.""" + c = read_report() + assert "RawDeployment" in c or "deploymentMode" in c, ( + "should include RawDeployment annotation (RHOAI deployment mode)" + ) + + def test_known_model_profile(self): + """Docs teach known model profiles: e.g., Llama 3.1 8B needs 1 GPU with 16GB VRAM, + --max-model-len=4096; 70B needs 4xA100 80GB with --tensor-parallel-size=4. + Without docs, agents can't size GPU allocation per model.""" + c = read_report().lower() + assert any(t in c for t in [ + "max-model-len", "max_model_len", "tensor-parallel-size", + "tensor_parallel_size", "16gb", "a100", "a10g", + ]) or ("gpu" in c and ("vram" in c or "model" in c and "profile" in c)), ( + "should reference known model GPU profiles for deployment sizing" + ) + + def test_nim_account_cr(self): + """Skill teaches that NIM deployments require a NIMAccount CR to be ready + before the ServingRuntime can pull images. Without skill, agents diagnose + ImagePullBackOff generically without identifying the NIMAccount dependency.""" + c = read_report() + assert any(t in c for t in [ + "NIMAccount", "NimAccount", "nim-account", "NIM Account", + "NIMAccountNotReady", + ]), "should identify NIMAccount CR as prerequisite for NIM deployment" diff --git a/evaluation/with_skills/rh-ai-engineer__nim-setup/environment/Dockerfile b/evaluation/with_skills/rh-ai-engineer__nim-setup/environment/Dockerfile new file mode 100644 index 00000000..d4978abe --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__nim-setup/environment/Dockerfile @@ -0,0 +1,74 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..d43c891d --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,540 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import base64 +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_create_or_update( + api_version: str, + kind: str, + namespace: str, + name: str, + body: str, +) -> str: + """Create or update a Kubernetes resource. Accepts apiVersion, kind, namespace, name, and body (JSON).""" + try: + resource = json.loads(body) + except json.JSONDecodeError as e: + raise ValueError(f"Invalid JSON body: {e}") from e + + resource.setdefault("metadata", {}) + resource["metadata"]["name"] = name + resource["metadata"]["namespace"] = namespace + resource["apiVersion"] = api_version + resource["kind"] = kind + + if kind == "Secret": + resource.setdefault("type", "Opaque") + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"Secret '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + if kind in ("NIMAccount", "Account") and "nim" in api_version.lower(): + resource.setdefault("status", {}) + resource["status"]["conditions"] = [ + { + "type": "Ready", + "status": "True", + "reason": "NGCCredentialsValid", + "message": "NGC API key validated successfully", + "lastTransitionTime": "2026-03-17T12:00:00Z", + }, + ] + resource["status"]["nimPullSecretStatus"] = "Ready" + resource["status"]["nimConfigStatus"] = "Ready" + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"NIM Account '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + if kind == "ConfigMap": + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"ConfigMap '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + raise ValueError(f"Unsupported kind for create/update: {kind}") + + +@mcp.tool() +def create_secret( + namespace: str, + name: str, + data: dict, + type: str = "Opaque", +) -> str: + """Create a Secret in a namespace. data is a dict of key-value pairs (values will be base64-encoded).""" + if isinstance(data, str): + data = json.loads(data) + encoded_data = {k: base64.b64encode(str(v).encode()).decode() for k, v in data.items()} + resource = { + "apiVersion": "v1", + "kind": "Secret", + "metadata": {"name": name, "namespace": namespace}, + "type": type, + "data": encoded_data, + } + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"Secret '{name}' created in namespace '{namespace}'", + }, indent=2) + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/with_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..0ae9e4cb --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,780 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__nim-setup/instruction.md b/evaluation/with_skills/rh-ai-engineer__nim-setup/instruction.md new file mode 100644 index 00000000..f0b5fa2c --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__nim-setup/instruction.md @@ -0,0 +1,17 @@ +# NVIDIA NIM Setup Task + +You are an AI engineer on Red Hat OpenShift AI. Your team wants to deploy NVIDIA NIM for GPU-accelerated inference, but the cluster has not been set up for it yet. + +## Scenario +The ML team needs to serve models using NVIDIA's inference microservices. The cluster has GPUs available, but the necessary platform components and credentials have not been configured. You need to assess readiness and produce a complete setup plan. + +## Requirements +- Verify operator prerequisites (GPU Operator and NFD Operator) by checking their ClusterServiceVersion status +- Assess the current cluster state to determine what NIM infrastructure is already in place and what is missing +- Document the complete setup procedure including: the exact Kubernetes Secret manifests (with types, data key names, and structure) needed for NGC authentication, and the NIM Account custom resource with its correct API group and spec fields +- Provide the YAML manifests for each resource that needs to be created, using the correct RHOAI-specific API versions and resource naming conventions +- Flag any potential issues or blockers discovered during your assessment + +Document your assessment and setup plan in `/root/report.md`. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-ai-engineer__nim-setup/solution/solve.sh b/evaluation/with_skills/rh-ai-engineer__nim-setup/solution/solve.sh new file mode 100644 index 00000000..accbf7fe --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__nim-setup/solution/solve.sh @@ -0,0 +1,28 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# NIM Setup Plan + +## Prerequisites +- GPU Operator CSV in nvidia-gpu-operator namespace (gpu-operator-certified) +- NFD (Node Feature Discovery) in openshift-nfd + +## NGC Secrets +- API key secret: ngc-api-key (NGC_API_KEY) +- Image pull secret: ngc-image-pull-secret + - Registry: nvcr.io + - Username: $oauthtoken + - Password: NGC API key + +## NIM Account CR (nim.opendatahub.io/v1) +```yaml +apiVersion: nim.opendatahub.io/v1 +kind: Account +metadata: + name: nim-account +spec: + apiKeySecret: + name: ngc-api-key + imagePullSecret: + name: ngc-image-pull-secret +``` +REPORT_EOF diff --git a/evaluation/with_skills/rh-ai-engineer__nim-setup/task.toml b/evaluation/with_skills/rh-ai-engineer__nim-setup/task.toml new file mode 100644 index 00000000..7b53288a --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__nim-setup/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__nim-setup" +name = "rh-ai-engineer NVIDIA NIM Setup Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "nim-setup", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-ai-engineer__nim-setup/tests/llm_judge.py b/evaluation/with_skills/rh-ai-engineer__nim-setup/tests/llm_judge.py new file mode 100644 index 00000000..a3c29b06 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__nim-setup/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "opendatahub_api_group", + "file": "/root/report.md", + "question": "Does the report use nim.opendatahub.io as the API group for the NIM Account custom resource, rather than the upstream nim.nvidia.com?", + "reference": "A skilled report specifies apiVersion: nim.opendatahub.io/v1 for the Account CR, which is the RHOAI-specific API group. An unskilled report uses nim.nvidia.com/v1alpha1 (the upstream NVIDIA API group) which is incorrect for Red Hat OpenShift AI." + }, + { + "id": "secret_naming_and_types", + "file": "/root/report.md", + "question": "Does the report create an image pull secret named ngc-image-pull-secret with type kubernetes.io/dockerconfigjson, and an API key secret with stringData containing the NGC_API_KEY field?", + "reference": "A skilled report creates ngc-image-pull-secret (type: kubernetes.io/dockerconfigjson) for nvcr.io registry access, and ngc-api-key (type: Opaque, stringData: NGC_API_KEY) for runtime auth. An unskilled report uses generic names like nvcr-credentials, kubectl shorthands without explicit types, or data.api_key instead of stringData.NGC_API_KEY." + }, + { + "id": "operator_csv_verification", + "file": "/root/report.md", + "question": "Does the report verify gpu-operator-certified and NFD (Node Feature Discovery) Operator as prerequisites, checking their ClusterServiceVersion status?", + "reference": "A skilled report checks for gpu-operator-certified (the specific CSV name, not just 'gpu-operator') and the NFD Operator in openshift-nfd namespace. An unskilled report either skips NFD entirely or uses generic gpu-operator references without the certified CSV name." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-ai-engineer__nim-setup/tests/test.sh b/evaluation/with_skills/rh-ai-engineer__nim-setup/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__nim-setup/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-ai-engineer__nim-setup/tests/test_outputs.py b/evaluation/with_skills/rh-ai-engineer__nim-setup/tests/test_outputs.py new file mode 100644 index 00000000..ad1f22ef --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__nim-setup/tests/test_outputs.py @@ -0,0 +1,89 @@ +""" +Tests for rh-ai-engineer__nim-setup per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert "nim" in content, "report should mention NIM" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_opendatahub_nim_api(self): + """Skill teaches nim.opendatahub.io as the RHOAI API group for NIM Account CR. + Without skill, agents use upstream nim.nvidia.com API group.""" + c = read_report() + assert "nim.opendatahub.io" in c, ( + "should use nim.opendatahub.io as the NIM Account CR API group (not nim.nvidia.com)" + ) + + def test_ngc_image_pull_secret_name(self): + """Skill teaches ngc-image-pull-secret as the specific secret name for nvcr.io. + Without skill, agents use generic names like nvcr-credentials.""" + c = read_report() + assert "ngc-image-pull-secret" in c, ( + "should use ngc-image-pull-secret as the image pull secret name" + ) + + def test_dockerconfigjson_secret_type(self): + """Skill teaches kubernetes.io/dockerconfigjson as the secret type for image pull. + Without skill, agents use kubectl docker-registry shorthand without explicit type.""" + c = read_report().lower() + assert "dockerconfigjson" in c, ( + "should specify dockerconfigjson as the image pull secret type" + ) + + def test_gpu_operator_certified_csv(self): + """Skill teaches checking gpu-operator-certified CSV by name. + Without skill, agents check generically for gpu-operator.""" + c = read_report().lower() + assert "gpu-operator-certified" in c, ( + "should verify gpu-operator-certified ClusterServiceVersion by name" + ) + + def test_nfd_operator_reference(self): + """Skill teaches verifying NFD (Node Feature Discovery) Operator as a prerequisite. + Without skill, agents skip NFD verification entirely.""" + c = read_report().lower() + assert "nfd" in c, ( + "should verify NFD (Node Feature Discovery) Operator as a prerequisite" + ) + + def test_stringdata_secret_field(self): + """Skill teaches using stringData in Secret YAML for NGC API key (no base64 needed). + Without skill, agents use kubectl --from-literal or data with base64.""" + c = read_report() + assert "stringData" in c or "stringdata" in c.lower(), ( + "should use stringData field in Secret YAML manifest for API key" + ) + + def test_nvidia_gpu_only(self): + """Docs emphasize NIM requires NVIDIA GPUs only; fallback to vLLM when + NVIDIA GPUs unavailable. Without docs, agents don't mention this constraint.""" + c = read_report().lower() + assert any(t in c for t in [ + "nvidia gpu", "nvidia only", "fallback", "vllm", + ]) and ("nim" in c or "gpu" in c), ( + "should note NIM requires NVIDIA GPUs with vLLM fallback" + ) diff --git a/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/environment/Dockerfile b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/environment/Dockerfile new file mode 100644 index 00000000..d4978abe --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/environment/Dockerfile @@ -0,0 +1,74 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..cad5f77b --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,529 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_create_or_update( + api_version: str, + kind: str, + namespace: str, + name: str, + body: str, +) -> str: + """Create or update a Kubernetes resource. Accepts apiVersion, kind, namespace, name, and body (JSON).""" + try: + resource = json.loads(body) + except json.JSONDecodeError as e: + raise ValueError(f"Invalid JSON body: {e}") from e + + resource.setdefault("metadata", {}) + resource["metadata"]["name"] = name + resource["metadata"]["namespace"] = namespace + resource["apiVersion"] = api_version + resource["kind"] = kind + + if kind == "ServingRuntime": + resource.setdefault("status", {}) + resource["status"]["conditions"] = [ + { + "type": "Ready", + "status": "True", + "reason": "ServingRuntimeReady", + "message": "ServingRuntime is ready", + "lastTransitionTime": "2026-03-17T12:00:00Z", + }, + ] + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"ServingRuntime '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + if kind == "Secret": + resource.setdefault("type", "Opaque") + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"Secret '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + if kind in ("NIMAccount", "Account") and "nim" in api_version.lower(): + resource.setdefault("status", {}) + resource["status"]["conditions"] = [ + { + "type": "Ready", + "status": "True", + "reason": "NGCCredentialsValid", + "message": "NGC API key validated successfully", + "lastTransitionTime": "2026-03-17T12:00:00Z", + }, + ] + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"NIM Account '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + if kind == "ConfigMap": + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"ConfigMap '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + raise ValueError(f"Unsupported kind for create/update: {kind}") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..0ae9e4cb --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,780 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/instruction.md b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/instruction.md new file mode 100644 index 00000000..d89e7c6a --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/instruction.md @@ -0,0 +1,19 @@ +# Serving Runtime Configuration Task + +You are an AI engineer on Red Hat OpenShift AI. Your team needs to serve a model using a custom inference engine that is not available as a default runtime on the platform. + +## Scenario +The existing platform-provided serving runtimes do not support the model format your team needs. You must create a custom runtime configuration that integrates properly with the platform and can be used to deploy models. + +## Requirements +- Examine the currently available serving runtimes and platform templates, distinguishing which are already instantiated versus which require instantiation before use +- Design a custom ServingRuntime CR that specifies the inference container, supported model formats, resource requirements, and API protocol +- Follow KServe container naming conventions so the runtime integrates correctly with the platform's model serving framework +- For runtimes supporting multiple model formats, explain how autoSelect should be configured to avoid format conflicts +- Explain where GPU resource allocation belongs (in the ServingRuntime vs in the InferenceService) and why +- Ensure the runtime will be visible and usable from the platform dashboard +- Document your design decisions and trade-offs + +Document your configuration plan and the complete runtime specification in `/root/report.md`. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/solution/solve.sh b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/solution/solve.sh new file mode 100644 index 00000000..043771f9 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/solution/solve.sh @@ -0,0 +1,31 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# ServingRuntime Configuration + +## Custom Runtime: triton-onnx + +Platform templates: list_serving_runtimes with include_templates: true. Templates with requires_instantiation: true use create_serving_runtime. + +```yaml +apiVersion: serving.kserve.io/v1alpha1 +kind: ServingRuntime +metadata: + name: triton-onnx-runtime + labels: + opendatahub.io/dashboard: "true" +spec: + supportedModelFormats: + - name: onnx + version: "1" + autoSelect: true + multiModel: false + containers: + - name: kserve-container + image: nvcr.io/nvidia/tritonserver:latest + ports: + - containerPort: 8080 + protocol: TCP +``` + +### Key: supportedModelFormats.name must match InferenceService modelFormat.name +REPORT_EOF diff --git a/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/task.toml b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/task.toml new file mode 100644 index 00000000..8ee93afa --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__serving-runtime-config" +name = "rh-ai-engineer Serving Runtime Configuration Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "serving-runtime-config", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/tests/llm_judge.py b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/tests/llm_judge.py new file mode 100644 index 00000000..11fdec60 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "kserve_container_naming", + "file": "/root/report.md", + "question": "Does the ServingRuntime YAML in the report name the main container 'kserve-container' (the required KServe naming convention)?", + "reference": "A skilled report names the container kserve-container in the ServingRuntime spec, which is required by KServe for the model serving framework to function correctly. An unskilled report might use a framework-specific name like 'triton' or 'vllm', which would cause KServe integration issues." + }, + { + "id": "gpu_allocation_strategy", + "file": "/root/report.md", + "question": "Does the report explain that GPU resources should NOT be hardcoded in the ServingRuntime and instead should be allocated at the InferenceService level for flexibility?", + "reference": "A skilled report explains that GPU resources (nvidia.com/gpu) belong at the InferenceService deployment level because different models need 0, 1, or multiple GPUs. The ServingRuntime should remain GPU-agnostic. An unskilled report hardcodes nvidia.com/gpu: 1 directly in the ServingRuntime spec." + }, + { + "id": "autoselect_and_api_conventions", + "file": "/root/report.md", + "question": "Does the report configure autoSelect: false for non-primary model formats and use the correct ServingRuntime API version (v1alpha1)?", + "reference": "A skilled report uses autoSelect: true only for the primary format and false for secondary formats to prevent conflicts, and uses the serving.kserve.io/v1alpha1 API version for ServingRuntime (distinct from v1beta1 used for InferenceService). An unskilled report sets autoSelect: true for all formats or uses the wrong API version." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/tests/test.sh b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/tests/test_outputs.py b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/tests/test_outputs.py new file mode 100644 index 00000000..71257bf2 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__serving-runtime-config/tests/test_outputs.py @@ -0,0 +1,97 @@ +""" +Tests for rh-ai-engineer__serving-runtime-config per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["servingruntime", "serving runtime", "runtime"]), ( + "report should mention ServingRuntime" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_kserve_container_name(self): + """Skill teaches the main container MUST be named kserve-container for KServe + compatibility. Without skill, agents use framework-specific names like 'triton'.""" + c = read_report() + assert "kserve-container" in c, ( + "should name the main container 'kserve-container' (required by KServe)" + ) + + def test_serving_runtime_api_version(self): + """Skill teaches ServingRuntime uses serving.kserve.io/v1alpha1 API (alpha, + not beta like InferenceService). Without skill, agents use v1beta1 or omit + the apiVersion distinction between ServingRuntime and InferenceService.""" + c = read_report() + assert "v1alpha1" in c or ( + "alpha" in c.lower() and "serving" in c.lower() + ), "should use v1alpha1 API version for ServingRuntime" + + def test_autoselect_false_for_secondary(self): + """Skill teaches using autoSelect: true only for primary format and false for + secondary formats to avoid conflicts. Without skill, agents set true for all.""" + c = read_report().lower() + assert "autoselect: false" in c or "autoselect\":false" in c or "autoselect\": false" in c, ( + "should use autoSelect: false for non-primary model formats" + ) + + def test_gpu_at_inferenceservice_level(self): + """Skill teaches not hardcoding GPU in ServingRuntime; GPU allocation belongs + at the InferenceService level for flexibility. Without skill, agents hardcode + nvidia.com/gpu in the runtime spec.""" + c = read_report().lower() + assert any(t in c for t in [ + "inferenceservice level", "inferenceservice deployment", + "per inferenceservice", "not specified in the servingruntime", + "gpu allocation happens at", + ]), "should explain GPU allocation belongs at InferenceService level, not in the runtime" + + def test_model_format_matching(self): + """Skill teaches that supportedModelFormats must match InferenceService model + format for runtime selection.""" + c = read_report().lower() + assert any(t in c for t in [ + "model format", "supportedmodelformat", "supported model format", + "inferenceservice", "match", + ]), "should address model format matching for runtime selection" + + def test_dashboard_label(self): + """Skill teaches opendatahub.io/dashboard label for dashboard visibility.""" + c = read_report().lower() + assert any(t in c for t in [ + "opendatahub", "dashboard", "label", "visible", + "platform", "display", + ]), "should address dashboard/platform visibility via labels" + + def test_caikit_tgis_grpc(self): + """Docs teach Caikit+TGIS is gRPC-only (no REST API) and NIM uses + TensorRT-LLM with pre-compiled engines. Without docs, agents assume REST + for all runtimes.""" + c = read_report().lower() + assert any(t in c for t in [ + "grpc", "caikit", "tgis", "tensorrt", + ]) and ("runtime" in c or "serving" in c), ( + "should note Caikit+TGIS gRPC-only or NIM TensorRT-LLM characteristics" + ) diff --git a/evaluation/with_skills/rh-ai-engineer__workbench-manage/environment/Dockerfile b/evaluation/with_skills/rh-ai-engineer__workbench-manage/environment/Dockerfile new file mode 100644 index 00000000..d4978abe --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__workbench-manage/environment/Dockerfile @@ -0,0 +1,74 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..e7a4d11c --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,457 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/with_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..12513127 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,866 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def get_workbench_url(namespace: str, name: str) -> str: + """Get the URL for accessing a running workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + if wb["status"] != "Running": + return json.dumps({ + "namespace": namespace, + "name": name, + "url": "", + "error": f"Workbench is not running (status: {wb['status']}). Start it first.", + }) + url = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + return json.dumps({ + "namespace": namespace, + "name": name, + "url": url, + "status": wb["status"], + }) + + +@mcp.tool() +def list_workbench_storage(namespace: str, name: str) -> str: + """List PVC details for a workbench including size, usage, access mode.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + volumes = [ + { + "pvc_name": wb.get("pvc_name", f"{name}-pvc"), + "size": wb.get("pvc_size", "20Gi"), + "usage": "12Gi", # Mock usage + "access_mode": wb.get("pvc_access_mode", "ReadWriteOnce"), + "mount_path": "/opt/app-root/data", + }, + ] + # Include additional volumes if any + for extra in wb.get("extra_volumes", []): + volumes.append(extra) + return json.dumps({ + "namespace": namespace, + "workbench": name, + "volumes": volumes, + }, indent=2) + + +@mcp.tool() +def add_workbench_storage( + namespace: str, + workbench_name: str, + pvc_name: str, + mount_path: str, + size: str, +) -> str: + """Add additional storage volume to a workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == workbench_name), None) + if not wb: + raise ValueError(f"Workbench '{workbench_name}' not found in '{namespace}'") + extra = wb.setdefault("extra_volumes", []) + extra.append({ + "pvc_name": pvc_name, + "size": size, + "usage": "0", + "access_mode": "ReadWriteOnce", + "mount_path": mount_path, + }) + return json.dumps({ + "status": "added", + "namespace": namespace, + "workbench": workbench_name, + "pvc_name": pvc_name, + "mount_path": mount_path, + "size": size, + }) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-ai-engineer__workbench-manage/instruction.md b/evaluation/with_skills/rh-ai-engineer__workbench-manage/instruction.md new file mode 100644 index 00000000..39b97c27 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__workbench-manage/instruction.md @@ -0,0 +1,13 @@ +# Workbench Management Task + +You are an AI engineer on Red Hat OpenShift AI. Your data science team needs workbenches set up for model development, and some existing workbenches need cleanup. + +## Requirements +- Review existing workbenches in the project: their status, resource usage, and notebook images +- Plan a new workbench for a data scientist who needs PyTorch with 4 CPUs, 16Gi memory, and 50Gi persistent storage +- Identify any stopped or unused workbenches that should be cleaned up to free resources +- Document the lifecycle procedures: how to stop a workbench to save resources, restart it, and safely delete one + +Document your workbench assessment, creation plan, and cleanup recommendations in `/root/report.md`. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-ai-engineer__workbench-manage/solution/solve.sh b/evaluation/with_skills/rh-ai-engineer__workbench-manage/solution/solve.sh new file mode 100644 index 00000000..49e5cc92 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__workbench-manage/solution/solve.sh @@ -0,0 +1,25 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Workbench Creation Plan + +## Workbench: fraud-analysis +Project/Namespace: fraud-detection + +### Storage (create_storage) +- PVC: 20Gi, access mode: ReadWriteOnce +- Namespace validated via list_data_science_projects + +### Configuration (create_workbench) +- Image: Jupyter Data Science Notebook (from list_notebook_images) +- CPU: 2 +- Memory: 8Gi +- Storage: 20Gi + +### Lifecycle +- start_workbench / stop_workbench for running/stopped state +- get_workbench_url: OAuth-protected notebook URL for access + +### Delete Warnings +- delete_workbench: Data loss warning — unsaved work lost, action cannot be undone +- delete_storage: Separate confirmation for PVC deletion — permanent data loss +REPORT_EOF diff --git a/evaluation/with_skills/rh-ai-engineer__workbench-manage/task.toml b/evaluation/with_skills/rh-ai-engineer__workbench-manage/task.toml new file mode 100644 index 00000000..6c538b09 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__workbench-manage/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__workbench-manage" +name = "rh-ai-engineer Workbench Management Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "workbench-manage", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-ai-engineer__workbench-manage/tests/llm_judge.py b/evaluation/with_skills/rh-ai-engineer__workbench-manage/tests/llm_judge.py new file mode 100644 index 00000000..b7792ec1 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__workbench-manage/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "stop_vs_delete_data", "file": "/root/report.md", "question": "Does the report explain that stopping a workbench preserves PVC data while deleting requires separate storage decision?", "reference": "A skilled report distinguishes stop (preserves) from delete (data loss risk). An unskilled report treats stop and delete equivalently."}, + {"id": "notebook_image_discovery", "file": "/root/report.md", "question": "Does the report describe discovering or listing available notebook images before creating a workbench?", "reference": "A skilled report lists available notebook images (via list_notebook_images or equivalent) to guide workbench creation. An unskilled report skips image discovery and assumes a default."}, + {"id": "storage_access_mode_awareness", "file": "/root/report.md", "question": "Does the report mention the PVC access mode (ReadWriteOnce or RWO) when describing workbench storage configuration or provisioning?", "reference": "A skilled report specifies the storage access mode (ReadWriteOnce) for PVC provisioning, showing awareness of storage class constraints. An unskilled report describes storage size but omits access mode details."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-ai-engineer__workbench-manage/tests/test.sh b/evaluation/with_skills/rh-ai-engineer__workbench-manage/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__workbench-manage/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-ai-engineer__workbench-manage/tests/test_outputs.py b/evaluation/with_skills/rh-ai-engineer__workbench-manage/tests/test_outputs.py new file mode 100644 index 00000000..59f74eec --- /dev/null +++ b/evaluation/with_skills/rh-ai-engineer__workbench-manage/tests/test_outputs.py @@ -0,0 +1,73 @@ +""" +Tests for rh-ai-engineer__workbench-manage per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["workbench", "notebook"]), ( + "report should mention workbench or notebook" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_stop_preserves_data(self): + """Skill teaches: stopping a workbench preserves PVC data; only delete removes it.""" + c = read_report().lower() + assert any(t in c for t in [ + "stop", "preserve", "data", "pvc", "storage", + "stopped", "restart", "start again", + ]), "should explain that stop preserves data vs delete" + + def test_delete_pvc_warning(self): + """Skill teaches: deleting workbench requires separate confirmation for PVC; warn about permanent data loss.""" + c = read_report().lower() + assert any(t in c for t in [ + "pvc", "delete", "data loss", "permanent", "warning", + "volume", "storage", "backup", "cannot be undone", + ]), "should warn about PVC/data loss on deletion" + + def test_lifecycle_operations(self): + """Skill teaches: create, start, stop, delete with distinct implications.""" + c = read_report().lower() + ops = sum(1 for t in ["start", "stop", "delet", "creat"] if t in c) + assert ops >= 2, "should describe lifecycle operations (create, start, stop, delete)" + + def test_list_notebook_images_tool(self): + """Skill teaches: list_notebook_images MCP tool to discover available notebook images.""" + c = read_report().lower() + assert any(t in c for t in ["list_notebook_images", "notebook images", "available images"]), ( + "should reference list_notebook_images tool (skill)" + ) + + def test_gpu_tuning_awareness(self): + """Docs teach GPU scheduling triage and OOM mitigation using + model/context-size controls for workbenches with GPU resources. + Without docs, agents don't address GPU resource tuning.""" + c = read_report().lower() + assert any(t in c for t in [ + "gpu", "oom", "context size", "max-model-len", "memory", + ]) and any(t in c for t in ["workbench", "notebook", "resource", "gpu"]), ( + "should address GPU/OOM tuning for workbench resources" + ) diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/Dockerfile b/evaluation/with_skills/rh-developer__containerize-deploy/environment/Dockerfile new file mode 100644 index 00000000..1cbfefcf --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__containerize-deploy/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/buildconfig.yaml.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/buildconfig.yaml.template new file mode 100644 index 00000000..b3294eb2 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/buildconfig.yaml.template @@ -0,0 +1,38 @@ +apiVersion: build.openshift.io/v1 +kind: BuildConfig +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: build + app.kubernetes.io/part-of: ${APP_NAME} +spec: + source: + type: Git + git: + uri: ${GIT_URL} + ref: ${GIT_BRANCH} + strategy: + type: Source + sourceStrategy: + from: + kind: DockerImage + name: ${BUILDER_IMAGE} + env: [] + output: + to: + kind: ImageStreamTag + name: ${APP_NAME}:latest + triggers: + - type: ConfigChange + - type: ImageChange + runPolicy: Serial + resources: + limits: + memory: "1Gi" + cpu: "1" + requests: + memory: "512Mi" + cpu: "500m" diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/deployment.yaml.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/deployment.yaml.template new file mode 100644 index 00000000..eb3b481a --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: application + app.kubernetes.io/part-of: ${APP_NAME} + annotations: + image.openshift.io/triggers: | + [{"from":{"kind":"ImageStreamTag","name":"${APP_NAME}:latest"},"fieldPath":"spec.template.spec.containers[0].image"}] +spec: + replicas: ${REPLICAS} + selector: + matchLabels: + app: ${APP_NAME} + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + template: + metadata: + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + spec: + containers: + - name: ${APP_NAME} + image: image-registry.openshift-image-registry.svc:5000/${NAMESPACE}/${APP_NAME}:latest + ports: + - containerPort: ${CONTAINER_PORT} + protocol: TCP + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + livenessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + readinessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 + env: [] + restartPolicy: Always + terminationGracePeriodSeconds: 30 diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/Chart.yaml.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/Chart.yaml.template new file mode 100644 index 00000000..1aa22dd1 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/Chart.yaml.template @@ -0,0 +1,13 @@ +apiVersion: v2 +name: ${APP_NAME} +description: ${APP_DESCRIPTION} +type: application +version: 0.1.0 +appVersion: "${APP_VERSION}" +keywords: + - ${LANGUAGE} + - ${FRAMEWORK} + - openshift +maintainers: + - name: ${MAINTAINER_NAME} + email: ${MAINTAINER_EMAIL} diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/NOTES.txt.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/NOTES.txt.template new file mode 100644 index 00000000..154e628d --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/NOTES.txt.template @@ -0,0 +1,32 @@ +Congratulations! Your application {{ include "${APP_NAME}.fullname" . }} has been deployed. + +{{- if .Values.route.enabled }} + +Access your application at: +{{- if .Values.route.host }} + https://{{ .Values.route.host }} +{{- else }} + Run: oc get route {{ include "${APP_NAME}.fullname" . }} -o jsonpath='{.spec.host}' +{{- end }} + +{{- else }} + +Your application is available internally at: + {{ include "${APP_NAME}.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.service.port }} + +To expose it externally, create a Route or set route.enabled=true. + +{{- end }} + +Useful commands: + # View pods + oc get pods -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} + + # View logs + oc logs -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} -f + + # Upgrade release + helm upgrade {{ .Release.Name }} ./{{ .Chart.Name }} -f values.yaml + + # Uninstall release + helm uninstall {{ .Release.Name }} diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/_helpers.tpl.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/_helpers.tpl.template new file mode 100644 index 00000000..15873b10 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/_helpers.tpl.template @@ -0,0 +1,60 @@ +{{/* +Expand the name of the chart. +*/}} +{{- define "${APP_NAME}.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Create a default fully qualified app name. +*/}} +{{- define "${APP_NAME}.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{/* +Create chart name and version as used by the chart label. +*/}} +{{- define "${APP_NAME}.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Common labels +*/}} +{{- define "${APP_NAME}.labels" -}} +helm.sh/chart: {{ include "${APP_NAME}.chart" . }} +{{ include "${APP_NAME}.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{/* +Selector labels +*/}} +{{- define "${APP_NAME}.selectorLabels" -}} +app.kubernetes.io/name: {{ include "${APP_NAME}.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{/* +Create the name of the service account to use +*/}} +{{- define "${APP_NAME}.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "${APP_NAME}.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/deployment.yaml.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/deployment.yaml.template new file mode 100644 index 00000000..a6cbd868 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + selector: + matchLabels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 8 }} + spec: + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + serviceAccountName: {{ include "${APP_NAME}.serviceAccountName" . }} + securityContext: + {{- toYaml .Values.podSecurityContext | nindent 8 }} + containers: + - name: {{ .Chart.Name }} + securityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + ports: + - name: http + containerPort: {{ .Values.service.port }} + protocol: TCP + livenessProbe: + {{- toYaml .Values.livenessProbe | nindent 12 }} + readinessProbe: + {{- toYaml .Values.readinessProbe | nindent 12 }} + resources: + {{- toYaml .Values.resources | nindent 12 }} + {{- with .Values.env }} + env: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/route.yaml.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/route.yaml.template new file mode 100644 index 00000000..e2bab29a --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/route.yaml.template @@ -0,0 +1,24 @@ +{{- if .Values.route.enabled }} +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if .Values.route.host }} + host: {{ .Values.route.host }} + {{- end }} + to: + kind: Service + name: {{ include "${APP_NAME}.fullname" . }} + weight: 100 + port: + targetPort: http + {{- with .Values.route.tls }} + tls: + termination: {{ .termination }} + insecureEdgeTerminationPolicy: {{ .insecureEdgeTerminationPolicy }} + {{- end }} + wildcardPolicy: None +{{- end }} diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/service.yaml.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/service.yaml.template new file mode 100644 index 00000000..837bc888 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/templates/service.yaml.template @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Service +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + - port: {{ .Values.service.port }} + targetPort: http + protocol: TCP + name: http + selector: + {{- include "${APP_NAME}.selectorLabels" . | nindent 4 }} diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/values.yaml.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/values.yaml.template new file mode 100644 index 00000000..1cca6017 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/helm/values.yaml.template @@ -0,0 +1,67 @@ +# Default values for ${APP_NAME} +replicaCount: 1 + +image: + repository: ${IMAGE_REPOSITORY} + pullPolicy: IfNotPresent + tag: "${IMAGE_TAG}" + +imagePullSecrets: [] +nameOverride: "" +fullnameOverride: "" + +serviceAccount: + create: true + annotations: {} + name: "" + +podAnnotations: {} +podSecurityContext: {} +securityContext: {} + +service: + type: ClusterIP + port: ${CONTAINER_PORT} + +route: + enabled: true + host: "" + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + +resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + +livenessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 30 + periodSeconds: 10 + +readinessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 5 + periodSeconds: 5 + +autoscaling: + enabled: false + minReplicas: 1 + maxReplicas: 5 + targetCPUUtilizationPercentage: 80 + +nodeSelector: {} +tolerations: [] +affinity: {} + +env: [] +# - name: MY_VAR +# value: "my-value" diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/imagestream.yaml.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/imagestream.yaml.template new file mode 100644 index 00000000..46572193 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/imagestream.yaml.template @@ -0,0 +1,13 @@ +apiVersion: image.openshift.io/v1 +kind: ImageStream +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: image + app.kubernetes.io/part-of: ${APP_NAME} +spec: + lookupPolicy: + local: false diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/route.yaml.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/route.yaml.template new file mode 100644 index 00000000..7c53d2e7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/route.yaml.template @@ -0,0 +1,21 @@ +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: route + app.kubernetes.io/part-of: ${APP_NAME} +spec: + to: + kind: Service + name: ${APP_NAME} + weight: 100 + port: + targetPort: http + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/service.yaml.template b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/service.yaml.template new file mode 100644 index 00000000..7e1cf371 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/service.yaml.template @@ -0,0 +1,20 @@ +apiVersion: v1 +kind: Service +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: service + app.kubernetes.io/part-of: ${APP_NAME} +spec: + selector: + app: ${APP_NAME} + ports: + - name: http + port: ${CONTAINER_PORT} + targetPort: ${CONTAINER_PORT} + protocol: TCP + type: ClusterIP + sessionAffinity: None diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/systemd/systemd-container-rootful.service b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/systemd/systemd-container-rootful.service new file mode 100644 index 00000000..c1e8fe8f --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/systemd/systemd-container-rootful.service @@ -0,0 +1,27 @@ +# Rootful Podman container managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/systemd/systemd-container-rootless.service b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/systemd/systemd-container-rootless.service new file mode 100644 index 00000000..ca9dc371 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/systemd/systemd-container-rootless.service @@ -0,0 +1,27 @@ +# Rootless Podman container managed by systemd (user service) +# Location: ~/.config/systemd/user/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=default.target diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/systemd/systemd-native.service b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/systemd/systemd-native.service new file mode 100644 index 00000000..c55cfc07 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/environment/templates/systemd/systemd-native.service @@ -0,0 +1,39 @@ +# Native application managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${SERVICE_USER} - User to run the service as +# ${APP_PATH} - Application install path (e.g., /opt/app-name) +# ${PORT} - Application listen port +# ${START_COMMAND} - Application start command +# +# Start command examples by language: +# Node.js: /usr/bin/node ${APP_PATH}/server.js +# Python: /usr/bin/python3 ${APP_PATH}/app.py +# Java: /usr/bin/java -jar ${APP_PATH}/app.jar +# Go: ${APP_PATH}/binary-name + +[Unit] +Description=${APP_NAME} Service +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=${SERVICE_USER} +WorkingDirectory=${APP_PATH} +Environment=PORT=${PORT} +ExecStart=${START_COMMAND} +Restart=always +RestartSec=5 + +# Security hardening +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +ReadWritePaths=${APP_PATH} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/instruction.md b/evaluation/with_skills/rh-developer__containerize-deploy/instruction.md new file mode 100644 index 00000000..42797f3c --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/instruction.md @@ -0,0 +1,15 @@ +# Containerization and Deployment Task + +You are a Red Hat developer. Your team has a Python web application that needs to be containerized and deployed to OpenShift. You need to evaluate the available approaches and recommend the best one. + +## Requirements +- Examine the application source and determine its language, dependencies, and build requirements +- Compare containerization strategies (e.g., S2I, Dockerfile, Helm chart) and explain the trade-offs of each for this application +- If a multi-stage Dockerfile approach is viable, include a working example showing build and runtime stages +- Recommend the best approach with a clear justification +- Define the deployment configuration including: resource requests/limits, all three probe types (startup, liveness, readiness), autoscaling (HPA), and how external traffic will reach the application +- Address application-specific concerns like database connection pooling configuration + +Document your strategy evaluation, recommendation, and deployment plan in `/root/report.md`. + +Use MCP tools to examine the environment. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/solution/solve.sh b/evaluation/with_skills/rh-developer__containerize-deploy/solution/solve.sh new file mode 100644 index 00000000..713efa82 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/solution/solve.sh @@ -0,0 +1,23 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Containerize and Deploy Plan + +## Phase 1: Detect +- Language: Python +- Framework: Django +- Entry point: manage.py + +## Phase 2: Strategy +- Target: OpenShift +- Strategy: S2I (recommended for Python on OpenShift) +- Alternative: Dockerfile with multi-stage build + +## Phase 3: Build +- Builder image: ubi9/python-311 +- APP_MODULE: myproject.wsgi:application + +## Phase 4: Deploy +- Deployment + Service + Route +- Port: 8000 (Django default) +- On failure: /debug-pod, /debug-build, /debug-network +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/task.toml b/evaluation/with_skills/rh-developer__containerize-deploy/task.toml new file mode 100644 index 00000000..9022cd22 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__containerize-deploy" +name = "rh-developer End-to-End Containerize and Deploy Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "containerize-deploy", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/tests/llm_judge.py b/evaluation/with_skills/rh-developer__containerize-deploy/tests/llm_judge.py new file mode 100644 index 00000000..0dc24c7f --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "multistage_and_startup_probe", + "file": "/root/report.md", + "question": "Does the report include both a multi-stage Dockerfile example (with COPY --from=builder or AS builder) AND a startup probe configuration?", + "reference": "A skilled report includes a multi-stage Dockerfile showing build and runtime stages with COPY --from=builder, and configures a startupProbe in addition to liveness/readiness probes. An unskilled report provides only a single-stage Dockerfile and only liveness/readiness probes without startup probe." + }, + { + "id": "hpa_and_pool_config", + "file": "/root/report.md", + "question": "Does the report include a HorizontalPodAutoscaler manifest (with autoscaling/v2 API) AND database connection pool configuration (SQLALCHEMY_POOL or equivalent)?", + "reference": "A skilled report includes a complete HPA YAML with kind: HorizontalPodAutoscaler and autoscaling/v2 API, plus SQLAlchemy connection pool settings (pool_size, pool_recycle). An unskilled report mentions autoscaling conceptually without the manifest, and skips connection pool configuration." + }, + { + "id": "strategy_comparison_depth", + "file": "/root/report.md", + "question": "Does the report compare at least 3 containerization strategies (S2I, Dockerfile, Helm) with specific trade-offs and a justified recommendation?", + "reference": "A skilled report provides a detailed comparison table of S2I, Dockerfile, and Helm with pros/cons/trade-offs for each, leading to a justified recommendation. An unskilled report may compare strategies superficially without detailed trade-offs." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/tests/test.sh b/evaluation/with_skills/rh-developer__containerize-deploy/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__containerize-deploy/tests/test_outputs.py b/evaluation/with_skills/rh-developer__containerize-deploy/tests/test_outputs.py new file mode 100644 index 00000000..5f7eec38 --- /dev/null +++ b/evaluation/with_skills/rh-developer__containerize-deploy/tests/test_outputs.py @@ -0,0 +1,110 @@ +""" +Tests for rh-developer__containerize-deploy per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_containerization(self): + content = read_report().lower() + assert any(t in content for t in ["container", "deploy", "image"]), ( + "report should mention containerization or deployment" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_startup_probe(self): + """Skill docs teach startup probe in addition to liveness/readiness. + Without skill, agents typically only include liveness and readiness probes.""" + c = read_report() + assert "startupProbe" in c or "startup probe" in c.lower() or "startupprobe" in c.lower(), ( + "should include startup probe configuration (startupProbe YAML key)" + ) + + def test_multistage_dockerfile_example(self): + """Skill docs teach multi-stage Dockerfile with COPY --from=builder pattern. + Without skill, agents mention multi-stage conceptually but don't provide the example.""" + c = read_report() + assert "COPY --from=" in c or "AS builder" in c or "copy --from=" in c.lower(), ( + "should include a multi-stage Dockerfile example with COPY --from= or AS builder syntax" + ) + + def test_hpa_autoscaling_config(self): + """Skill docs teach complete HPA configuration with autoscaling API. + Without skill, agents mention autoscaling conceptually but skip the manifest.""" + c = read_report() + assert "HorizontalPodAutoscaler" in c or "autoscaling/v2" in c, ( + "should include HorizontalPodAutoscaler manifest or autoscaling/v2 API reference" + ) + + def test_connection_pool_config(self): + """Skill docs teach application-specific database connection pooling with + SQLAlchemy settings. Without skill, agents skip pool configuration details.""" + c = read_report() + assert any(t in c for t in [ + "SQLALCHEMY_POOL", "pool_size", "POOL_SIZE", + "pool_recycle", "POOL_RECYCLE", + ]), "should include SQLAlchemy connection pool settings (pool_size, pool_recycle)" + + def test_strategy_comparison(self): + """Skill teaches comparing at least 2 containerization strategies with trade-offs.""" + c = read_report().lower() + strategies = ["s2i", "dockerfile", "helm", "podman", "source-to-image"] + mentioned = sum(1 for s in strategies if s in c) + assert mentioned >= 2, "should compare at least 2 containerization strategies" + + def test_session_affinity_config(self): + """Skill docs teach explicit sessionAffinity configuration in Service spec. + Without skill, agents skip this detail in the Service definition.""" + c = read_report().lower() + assert "sessionaffinity" in c or "session affinity" in c, ( + "should specify sessionAffinity in Service configuration" + ) + + def test_app_module_s2i_entrypoint(self): + """Skill teaches APP_MODULE environment variable for S2I Python startup + (e.g., app:app). Without skill, agents don't know this S2I-specific + configuration for WSGI entry point discovery.""" + c = read_report() + assert "APP_MODULE" in c or "app:app" in c or "APP_SCRIPT" in c, ( + "should reference APP_MODULE or app:app S2I entrypoint configuration" + ) + + def test_gunicorn_worker_formula(self): + """Skill teaches Gunicorn worker count formula: (2 × CPU cores) + 1. + Without skill, agents hardcode worker count without the sizing formula.""" + c = read_report() + assert any(t in c for t in [ + "2 * cores", "2 × CPU", "(2 * cores) + 1", "2 × cores", + "2*cores", "2 * cpu", "2x CPU", "2 x cores", + ]) or ("worker" in c.lower() and ("formula" in c.lower() or "cores" in c.lower())), ( + "should include Gunicorn worker count formula based on CPU cores" + ) + + def test_sqlalchemy_engine_options(self): + """Skill teaches SQLALCHEMY_ENGINE_OPTIONS configuration for advanced + pool tuning. Without skill, agents configure individual pool parameters + but miss the unified engine options dict.""" + c = read_report() + assert "SQLALCHEMY_ENGINE_OPTIONS" in c or "engine_options" in c, ( + "should include SQLALCHEMY_ENGINE_OPTIONS for advanced pool configuration" + ) diff --git a/evaluation/with_skills/rh-developer__debug-build/environment/Dockerfile b/evaluation/with_skills/rh-developer__debug-build/environment/Dockerfile new file mode 100644 index 00000000..1cbfefcf --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-build/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__debug-build/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__debug-build/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..5f7e49b1 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-build/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,755 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + { + "name": "api-service-2", + "namespace": "api-platform", + "status": "Failed", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "1m48s", + "reason": "AssembleFailed", + "message": "Assemble script failed with exit code 1", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "api-service-2": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.110.0\n" + "Collecting uvicorn==0.27.1\n" + "Collecting pydantic==2.6.0\n" + "Collecting psycopg2==2.9.9\n" + " ERROR: Could not build wheels for psycopg2, which is required to install pyproject.toml-based projects\n" + " error: subprocess-exited-with-error\n" + " × Running setup.py install for psycopg2 did not run successfully.\n" + " │ exit code: 1\n" + " ╰─> [25 lines of output]\n" + " Error: pg_config executable not found.\n" + " pg_config is required to build psycopg2 from source.\n" + " Please add the directory containing pg_config to the $PATH\n" + " or specify the full executable path with the option:\n" + " python setup.py build_ext --pg-config /path/to/pg_config\n" + " note: This error originates from a subprocess, and is likely not a problem with pip.\n" + "error: legacy-install-failure\n" + "---> Assemble script FAILED with exit code 1\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__debug-build/instruction.md b/evaluation/with_skills/rh-developer__debug-build/instruction.md new file mode 100644 index 00000000..2cfea7f9 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-build/instruction.md @@ -0,0 +1,14 @@ +# Build Debugging Task + +You are a Red Hat developer. An OpenShift Source-to-Image (S2I) build is failing. Investigate the build process to identify and fix the issue. + +## Requirements +- Examine the build configuration and logs +- Identify which S2I build phase is failing (fetch, pull, assemble, commit, push) +- If the fix involves S2I customization, explain how S2I assemble scripts can be extended or overridden +- Provide multiple fix options with concrete commands or file changes, using the appropriate package manager for UBI-based builder images +- Recommend a fix + +Use MCP tools to examine the cluster. Document your methodology, findings, and recommendations in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__debug-build/solution/solve.sh b/evaluation/with_skills/rh-developer__debug-build/solution/solve.sh new file mode 100644 index 00000000..1e0579ec --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-build/solution/solve.sh @@ -0,0 +1,21 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Build Debug Report + +## Build Failure Analysis + +### S2I Build Phases +1. Fetching source ✓ +2. Pulling builder image ✓ +3. **Assemble** ✗ (FAILED) +4. Commit (not reached) +5. Push (not reached) + +### Root Cause +Assemble phase failed — likely dependency installation error in pip install. + +### Fix +- Check requirements.txt for version conflicts (gunicorn, APP_MODULE) +- Verify builder image compatibility (python:3.11-ubi9) +- Retry: `oc start-build flask-app -n myproject --follow` +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__debug-build/task.toml b/evaluation/with_skills/rh-developer__debug-build/task.toml new file mode 100644 index 00000000..af5ff817 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-build/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-build" +name = "rh-developer Build Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-build", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__debug-build/tests/llm_judge.py b/evaluation/with_skills/rh-developer__debug-build/tests/llm_judge.py new file mode 100644 index 00000000..7bfd7911 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-build/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "s2i_custom_assemble", + "file": "/root/report.md", + "question": "Does the report mention .s2i/bin/assemble as a way to customize the S2I build process, and reference the default assemble script path at /usr/libexec/s2i/assemble?", + "reference": "A skilled report shows creating a .s2i/bin/assemble script that installs missing packages and then calls /usr/libexec/s2i/assemble (the default assemble script). An unskilled report recommends a custom Dockerfile or builder image instead of using S2I customization hooks." + }, + { + "id": "phase_diagnosis_and_remediation", + "file": "/root/report.md", + "question": "Does the report identify which S2I phase (fetch, assemble, commit, push) failed and provide concrete oc commands for remediation?", + "reference": "A skilled report breaks down the build into phases, identifies the failing phase, and provides actionable commands like 'oc start-build' to retry. An unskilled report gives a generic build failure description." + }, + { + "id": "systematic_build_analysis", + "file": "/root/report.md", + "question": "Does the report follow a systematic approach: inspecting the BuildConfig, analyzing build logs by phase, checking related resources (secrets, imagestreams), and providing structured findings with concrete remediation?", + "reference": "A skilled report follows a structured debugging workflow: BuildConfig analysis, phase-by-phase log analysis, related resource checks, and categorized findings with concrete remediation commands. An unskilled report gives ad-hoc observations without systematic investigation." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__debug-build/tests/test.sh b/evaluation/with_skills/rh-developer__debug-build/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-build/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__debug-build/tests/test_outputs.py b/evaluation/with_skills/rh-developer__debug-build/tests/test_outputs.py new file mode 100644 index 00000000..c3ac3895 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-build/tests/test_outputs.py @@ -0,0 +1,77 @@ +""" +Tests for rh-developer__debug-build per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_build(self): + content = read_report().lower() + assert "build" in content, "report should mention builds" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_s2i_custom_assemble_script(self): + """Skill teaches creating .s2i/bin/assemble to extend the S2I build process. + Without skill, agents recommend Dockerfile or custom builder image instead.""" + c = read_report() + assert ".s2i/bin/assemble" in c or ".s2i/bin" in c, ( + "should mention .s2i/bin/assemble as a way to customize the S2I build" + ) + + def test_default_assemble_path(self): + """Skill teaches invoking the default S2I assemble script at /usr/libexec/s2i/assemble. + Without skill, agents don't know the default script path.""" + c = read_report() + assert "/usr/libexec/s2i/" in c or "libexec/s2i" in c, ( + "should reference the default S2I assemble script at /usr/libexec/s2i/" + ) + + def test_package_manager_awareness(self): + """Report should mention package installation approach for the builder image.""" + c = read_report().lower() + assert any(t in c for t in ["microdnf", "dnf", "yum", "package manager", "install package"]), ( + "should mention package installation approach for the builder image" + ) + + def test_s2i_phase_breakdown(self): + """Skill teaches S2I phases (fetch, pull, assemble, commit, push).""" + c = read_report().lower() + phases = ["assemble", "fetch", "pull", "push", "commit"] + mentioned = sum(1 for p in phases if p in c) + assert mentioned >= 2, ( + "should identify S2I build phases (skill teaches phase-by-phase diagnosis)" + ) + + def test_concrete_remediation_command(self): + """Skill teaches providing concrete oc/command remediation.""" + c = read_report().lower() + assert any(t in c for t in ["oc ", "oc start-build", "oc create", "oc import", "retry"]) or ( + "```" in read_report() and ("oc" in c or "bash" in c) + ), "should include concrete remediation commands" + + def test_dependency_fix_suggestion(self): + """Report should suggest concrete dependency fixes for the failing build.""" + c = read_report().lower() + assert any(t in c for t in [ + "psycopg", "pip install", "requirements", "dependency", "package" + ]), "should suggest concrete dependency fixes for the failing build" diff --git a/evaluation/with_skills/rh-developer__debug-container/environment/Dockerfile b/evaluation/with_skills/rh-developer__debug-container/environment/Dockerfile new file mode 100644 index 00000000..a4c2cd43 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-container/environment/Dockerfile @@ -0,0 +1,74 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "podman": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-podman-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__debug-container/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__debug-container/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-container/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__debug-container/environment/mcp-servers/mock-podman-mcp.py b/evaluation/with_skills/rh-developer__debug-container/environment/mcp-servers/mock-podman-mcp.py new file mode 100644 index 00000000..3d86ba08 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-container/environment/mcp-servers/mock-podman-mcp.py @@ -0,0 +1,396 @@ +#!/usr/bin/env python3 +"""Mock Podman MCP Server for container debugging evaluation. + +Simulates a local Podman environment with several containers, including +one that is crashing (OOMKilled) and one that has an entrypoint error. + +Scenario: + - myapp-web: Exited (137) - OOMKilled, memory limit 256m too low + - myapp-worker: Exited (1) - missing Python dependency 'celery' + - nginx-proxy: Running, healthy + - postgres-db: Running, healthy +""" + +import json +from typing import Optional + +from fastmcp import FastMCP + +mcp = FastMCP("podman") + +NOW = "2026-03-02T12:00:00Z" + +CONTAINERS = { + "a1b2c3d4e5f6": { + "Id": "a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890", + "Names": ["myapp-web"], + "Image": "myapp:latest", + "ImageID": "sha256:abc123def456789012345678901234567890abcdef1234567890abcdef123456", + "Created": "2026-03-01T10:00:00Z", + "State": { + "Status": "exited", + "Running": False, + "Paused": False, + "Restarting": False, + "OOMKilled": True, + "Dead": False, + "Pid": 0, + "ExitCode": 137, + "Error": "", + "StartedAt": "2026-03-01T10:00:05Z", + "FinishedAt": "2026-03-02T08:45:12Z", + }, + "Config": { + "Entrypoint": ["python3"], + "Cmd": ["-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"], + "WorkingDir": "/app", + "User": "1001", + "Env": [ + "APP_ENV=production", + "DATABASE_URL=postgresql://db:5432/myapp", + "WORKERS=4", + "MAX_REQUESTS=1000", + ], + "ExposedPorts": {"8080/tcp": {}}, + }, + "HostConfig": { + "Memory": 268435456, + "MemorySwap": 268435456, + "CpuQuota": 100000, + "CpuPeriod": 100000, + "PortBindings": {"8080/tcp": [{"HostIp": "0.0.0.0", "HostPort": "8080"}]}, + "Binds": ["/data/myapp:/app/data:rw"], + }, + "Mounts": [ + {"Type": "bind", "Source": "/data/myapp", "Destination": "/app/data", "Mode": "rw"}, + ], + }, + "b2c3d4e5f6a7": { + "Id": "b2c3d4e5f6a7890123456789abcdef1234567890abcdef1234567890abcdef12", + "Names": ["myapp-worker"], + "Image": "myapp:latest", + "ImageID": "sha256:abc123def456789012345678901234567890abcdef1234567890abcdef123456", + "Created": "2026-03-01T10:00:00Z", + "State": { + "Status": "exited", + "Running": False, + "Paused": False, + "Restarting": False, + "OOMKilled": False, + "Dead": False, + "Pid": 0, + "ExitCode": 1, + "Error": "", + "StartedAt": "2026-03-01T10:00:08Z", + "FinishedAt": "2026-03-01T10:00:12Z", + }, + "Config": { + "Entrypoint": ["python3"], + "Cmd": ["-m", "celery", "-A", "tasks", "worker", "--loglevel=info"], + "WorkingDir": "/app", + "User": "1001", + "Env": [ + "APP_ENV=production", + "DATABASE_URL=postgresql://db:5432/myapp", + "CELERY_BROKER_URL=redis://redis:6379/0", + ], + }, + "HostConfig": { + "Memory": 536870912, + "MemorySwap": 1073741824, + "CpuQuota": 0, + "CpuPeriod": 0, + }, + "Mounts": [], + }, + "c3d4e5f6a7b8": { + "Id": "c3d4e5f6a7b8901234567890abcdef1234567890abcdef1234567890abcdef12", + "Names": ["nginx-proxy"], + "Image": "nginx:1.25", + "ImageID": "sha256:def456789012345678901234567890abcdef1234567890abcdef1234567890ab", + "Created": "2026-02-28T08:00:00Z", + "State": { + "Status": "running", + "Running": True, + "Paused": False, + "Restarting": False, + "OOMKilled": False, + "Dead": False, + "Pid": 12345, + "ExitCode": 0, + "Error": "", + "StartedAt": "2026-02-28T08:00:05Z", + "FinishedAt": "0001-01-01T00:00:00Z", + }, + "Config": { + "Entrypoint": ["/docker-entrypoint.sh"], + "Cmd": ["nginx", "-g", "daemon off;"], + "WorkingDir": "", + "User": "", + "Env": ["NGINX_PORT=80"], + "ExposedPorts": {"80/tcp": {}, "443/tcp": {}}, + }, + "HostConfig": { + "Memory": 0, + "MemorySwap": 0, + "CpuQuota": 0, + "CpuPeriod": 0, + "PortBindings": { + "80/tcp": [{"HostIp": "0.0.0.0", "HostPort": "80"}], + "443/tcp": [{"HostIp": "0.0.0.0", "HostPort": "443"}], + }, + }, + "Mounts": [ + {"Type": "bind", "Source": "/etc/nginx/conf.d", "Destination": "/etc/nginx/conf.d", "Mode": "ro"}, + ], + }, + "d4e5f6a7b8c9": { + "Id": "d4e5f6a7b8c9012345678901abcdef1234567890abcdef1234567890abcdef12", + "Names": ["postgres-db"], + "Image": "postgres:15", + "ImageID": "sha256:789012345678901234567890abcdef1234567890abcdef1234567890abcdef12", + "Created": "2026-02-25T12:00:00Z", + "State": { + "Status": "running", + "Running": True, + "Paused": False, + "Restarting": False, + "OOMKilled": False, + "Dead": False, + "Pid": 23456, + "ExitCode": 0, + "Error": "", + "StartedAt": "2026-02-25T12:00:10Z", + "FinishedAt": "0001-01-01T00:00:00Z", + }, + "Config": { + "Entrypoint": ["docker-entrypoint.sh"], + "Cmd": ["postgres"], + "WorkingDir": "", + "User": "postgres", + "Env": [ + "POSTGRES_DB=myapp", + "POSTGRES_USER=app", + "PGDATA=/var/lib/postgresql/data", + ], + "ExposedPorts": {"5432/tcp": {}}, + }, + "HostConfig": { + "Memory": 1073741824, + "MemorySwap": 2147483648, + "CpuQuota": 0, + "CpuPeriod": 0, + "PortBindings": {"5432/tcp": [{"HostIp": "127.0.0.1", "HostPort": "5432"}]}, + }, + "Mounts": [ + {"Type": "volume", "Source": "pgdata", "Destination": "/var/lib/postgresql/data", "Mode": "rw"}, + ], + }, +} + +LOGS = { + "myapp-web": ( + "INFO: Started server process [1]\n" + "INFO: Waiting for application startup.\n" + "INFO: Application startup complete.\n" + "INFO: Uvicorn running on http://0.0.0.0:8080\n" + "INFO: Loading ML model into memory...\n" + "INFO: Model size: 1.2GB\n" + "WARNING: Memory usage at 89% of limit (237MB/256MB)\n" + "INFO: Processing request batch (32 items)\n" + "WARNING: Memory usage at 95% of limit (248MB/256MB)\n" + "WARNING: Memory pressure detected, attempting GC\n" + "INFO: GC freed 12MB, usage now at 92%\n" + "INFO: Processing request batch (64 items)\n" + "CRITICAL: Memory usage exceeded limit\n" + "Killed\n" + ), + "myapp-worker": ( + "Traceback (most recent call last):\n" + ' File "/usr/lib/python3.11/runpy.py", line 198, in _run_module_as_main\n' + ' return _run_code(code, main_globals, None,\n' + ' File "/usr/lib/python3.11/runpy.py", line 88, in _run_code\n' + ' exec(code, run_globals)\n' + "ModuleNotFoundError: No module named 'celery'\n" + ), + "nginx-proxy": ( + "2026/02/28 08:00:05 [notice] 1#1: nginx/1.25.4\n" + "2026/02/28 08:00:05 [notice] 1#1: built by gcc 12.2.0\n" + "2026/02/28 08:00:05 [notice] 1#1: OS: Linux 5.14.0-362.el9.x86_64\n" + "2026/02/28 08:00:05 [notice] 1#1: start worker processes\n" + "2026/02/28 08:00:05 [notice] 1#1: start worker process 29\n" + "2026/02/28 08:00:05 [notice] 1#1: start worker process 30\n" + ), + "postgres-db": ( + "PostgreSQL init process complete; ready for start up.\n" + '2026-02-25 12:00:10.123 UTC [1] LOG: starting PostgreSQL 15.5\n' + '2026-02-25 12:00:10.456 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432\n' + '2026-02-25 12:00:10.789 UTC [1] LOG: database system is ready to accept connections\n' + ), +} + +IMAGES = [ + { + "Id": "sha256:abc123def456789012345678901234567890abcdef1234567890abcdef123456", + "RepoTags": ["myapp:latest"], + "Created": "2026-02-28T15:30:00Z", + "Size": 1345678901, + "VirtualSize": 1345678901, + "Labels": {"maintainer": "dev@myapp.io", "version": "2.1.0"}, + "Config": { + "Entrypoint": ["python3"], + "Cmd": ["-m", "uvicorn", "main:app"], + "WorkingDir": "/app", + "ExposedPorts": {"8080/tcp": {}}, + "Env": ["PYTHONDONTWRITEBYTECODE=1", "PYTHONUNBUFFERED=1"], + }, + }, + { + "Id": "sha256:def456789012345678901234567890abcdef1234567890abcdef1234567890ab", + "RepoTags": ["nginx:1.25"], + "Created": "2026-01-15T10:00:00Z", + "Size": 187654321, + "VirtualSize": 187654321, + "Labels": {"maintainer": "NGINX Docker Maintainers"}, + "Config": { + "Entrypoint": ["/docker-entrypoint.sh"], + "Cmd": ["nginx", "-g", "daemon off;"], + "ExposedPorts": {"80/tcp": {}}, + }, + }, + { + "Id": "sha256:789012345678901234567890abcdef1234567890abcdef1234567890abcdef12", + "RepoTags": ["postgres:15"], + "Created": "2026-01-20T12:00:00Z", + "Size": 412345678, + "VirtualSize": 412345678, + "Labels": {"maintainer": "PostgreSQL Docker Maintainers"}, + "Config": { + "Entrypoint": ["docker-entrypoint.sh"], + "Cmd": ["postgres"], + "ExposedPorts": {"5432/tcp": {}}, + }, + }, +] + + +def _find_container(name_or_id: str): + for cid, c in CONTAINERS.items(): + if name_or_id in (cid, c["Id"]): + return c + if name_or_id in c["Names"]: + return c + return None + + +@mcp.tool() +def container_list(all: bool = True) -> str: + """List containers. Set all=True to include stopped containers.""" + results = [] + for cid, c in CONTAINERS.items(): + if not all and not c["State"]["Running"]: + continue + status = c["State"]["Status"] + if c["State"]["OOMKilled"]: + status = f"Exited (137) OOMKilled" + elif c["State"]["ExitCode"] != 0 and not c["State"]["Running"]: + status = f"Exited ({c['State']['ExitCode']})" + elif c["State"]["Running"]: + status = "Up 2 days" + results.append({ + "Id": cid, + "Names": c["Names"], + "Image": c["Image"], + "Status": status, + "Created": c["Created"], + "Ports": list(c["Config"].get("ExposedPorts", {}).keys()), + }) + return json.dumps(results, indent=2) + + +@mcp.tool() +def container_inspect(name: str) -> str: + """Inspect a container by name or ID. Returns detailed configuration and state.""" + c = _find_container(name) + if not c: + raise ValueError(f"no container with name or ID \"{name}\": no such container") + return json.dumps(c, indent=2) + + +@mcp.tool() +def container_logs(name: str, tail: int = 100) -> str: + """Get logs from a container by name or ID.""" + c = _find_container(name) + if not c: + raise ValueError(f"no container with name or ID \"{name}\": no such container") + cname = c["Names"][0] + log = LOGS.get(cname, f"No logs available for {cname}") + return log + + +@mcp.tool() +def container_stats(name: Optional[str] = None) -> str: + """Get resource usage statistics for running containers.""" + results = [] + for cid, c in CONTAINERS.items(): + if name and name not in c["Names"] and name != cid: + continue + if not c["State"]["Running"]: + continue + mem_limit = c["HostConfig"]["Memory"] or 8589934592 + results.append({ + "Id": cid, + "Name": c["Names"][0], + "CPUPerc": "12.5%", + "MemUsage": f"{mem_limit // 4} / {mem_limit}", + "MemPerc": "25.0%", + "NetIO": "1.2MB / 500KB", + "BlockIO": "50MB / 10MB", + "PIDs": 15, + }) + if not results: + return "No running containers found" + (f" matching '{name}'" if name else "") + return json.dumps(results, indent=2) + + +@mcp.tool() +def container_top(name: str) -> str: + """Display the running processes of a container.""" + c = _find_container(name) + if not c: + raise ValueError(f"no container with name or ID \"{name}\": no such container") + if not c["State"]["Running"]: + raise ValueError(f"container {c['Names'][0]} is not running") + return ( + "UID PID PPID C STIME TTY TIME CMD\n" + f"1001 12345 1 0 08:00 ? 00:05:00 {' '.join(c['Config'].get('Cmd', ['']))}\n" + ) + + +@mcp.tool() +def image_list() -> str: + """List all container images.""" + results = [] + for img in IMAGES: + size_mb = img["Size"] // (1024 * 1024) + results.append({ + "Id": img["Id"][:19], + "RepoTags": img["RepoTags"], + "Created": img["Created"], + "Size": f"{size_mb}MB", + "Labels": img.get("Labels", {}), + }) + return json.dumps(results, indent=2) + + +@mcp.tool() +def image_inspect(name: str) -> str: + """Inspect a container image by name or ID.""" + for img in IMAGES: + if name in img["RepoTags"] or name == img["Id"] or img["Id"].startswith(f"sha256:{name}"): + return json.dumps(img, indent=2) + raise ValueError(f"image \"{name}\" not found") + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-developer__debug-container/instruction.md b/evaluation/with_skills/rh-developer__debug-container/instruction.md new file mode 100644 index 00000000..52862c6a --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-container/instruction.md @@ -0,0 +1,16 @@ +# Container Debugging Task + +You are a Red Hat developer. Two containers in your local environment have stopped working -- one exited with code 137 and another exited with code 1. Investigate why each container failed and recommend fixes. + +## Requirements +- List all containers (including stopped ones) and identify which are failing +- For each failing container: inspect its configuration, review logs, and check resource limits +- Determine the root cause of each failure (e.g., memory exhaustion, missing dependency, misconfigured entrypoint) +- Recommend a specific fix for each container, including the corrected run command with proper cleanup of the failed container first +- Follow container security best practices (e.g., non-root user) in your fix commands +- Include verification commands to confirm the fix resolved the issue (e.g., checking container state for OOM status) +- If separate image variants would be a better long-term solution, explain that approach + +Document your investigation and fixes in `/root/report.md`. + +Use available tools to examine the environment. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__debug-container/solution/solve.sh b/evaluation/with_skills/rh-developer__debug-container/solution/solve.sh new file mode 100644 index 00000000..421b9a1a --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-container/solution/solve.sh @@ -0,0 +1,18 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Container Debug Report + +## Issue: Container exits immediately + +### Diagnosis +1. `podman inspect` → State.ExitCode: 1, State.OOMKilled: false +2. `podman logs` → Error: entrypoint not found +3. Check image entrypoint/CMD + +### Root Cause +Image entrypoint points to a binary that doesn't exist in the container. + +### Fix +- Override entrypoint: `podman run --entrypoint /bin/sh myimage` +- Or fix Dockerfile CMD/ENTRYPOINT +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__debug-container/task.toml b/evaluation/with_skills/rh-developer__debug-container/task.toml new file mode 100644 index 00000000..cd098d3a --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-container/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-container" +name = "rh-developer Container Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-container", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__debug-container/tests/llm_judge.py b/evaluation/with_skills/rh-developer__debug-container/tests/llm_judge.py new file mode 100644 index 00000000..c11e081d --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-container/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "nonroot_user_and_cleanup", + "file": "/root/report.md", + "question": "Does the report include --user 1001 (non-root) in the corrected podman run command AND proper container cleanup (podman stop/rm) before rerunning?", + "reference": "A skilled report includes --user 1001 for container security and shows 'podman stop/rm' cleanup (often with 2>/dev/null || true error suppression) before the corrected run command. An unskilled report omits the --user flag and skips cleanup steps." + }, + { + "id": "image_variant_strategy", + "file": "/root/report.md", + "question": "Does the report recommend separate image variants/tags (e.g., using --build-arg VARIANT=web/worker) for different container roles as a long-term solution?", + "reference": "A skilled report explains that web and worker containers should use separate image tags built with --build-arg VARIANT, rather than sharing a single image. An unskilled report only suggests adding the missing dependency to the shared image." + }, + { + "id": "oomkilled_verification", + "file": "/root/report.md", + "question": "Does the report include verification commands using jq to inspect container state (e.g., podman inspect | jq '.State.OOMKilled')?", + "reference": "A skilled report includes 'podman inspect | jq .State.OOMKilled' to programmatically verify OOM status after fixing. An unskilled report checks logs or status manually without jq-based state inspection." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__debug-container/tests/test.sh b/evaluation/with_skills/rh-developer__debug-container/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-container/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__debug-container/tests/test_outputs.py b/evaluation/with_skills/rh-developer__debug-container/tests/test_outputs.py new file mode 100644 index 00000000..34782966 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-container/tests/test_outputs.py @@ -0,0 +1,93 @@ +""" +Tests for rh-developer__debug-container per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_container(self): + content = read_report().lower() + assert "container" in content, "report should mention container" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_nonroot_user(self): + """Skill teaches running containers as non-root user (--user 1001). + Without skill, agents omit the --user flag in fix commands.""" + c = read_report() + assert "--user" in c or "user 1001" in c.lower(), ( + "should include --user flag for non-root container execution" + ) + + def test_image_variant_strategy(self): + """Skill teaches separate image tags/variants (--build-arg VARIANT=) for + different container roles. Without skill, agents use same image for all roles.""" + c = read_report() + assert "--build-arg" in c or "VARIANT=" in c or "separate image" in c.lower(), ( + "should recommend separate image variants for different roles (web vs worker)" + ) + + def test_oomkilled_state_inspection(self): + """Skill teaches verifying OOMKilled state via container inspect. + Without skill, agents infer OOM from exit code only without inspecting state.""" + c = read_report() + assert any(t in c for t in [ + ".State.OOMKilled", "OOMKilled", "oomkilled", + "State.OOMKilled", "OOMKilled=true", "oomkilled=true", + ]) and any(t in c for t in [ + "inspect", "Inspect", "state", "State", + ]), "should inspect container state to verify OOMKilled" + + def test_cleanup_before_rerun(self): + """Skill teaches proper cleanup (stop + rm with error suppression) before + rerunning a failed container. Without skill, agents skip cleanup.""" + c = read_report() + assert "2>/dev/null" in c or ("podman stop" in c and "podman rm" in c) or ( + "podman rm" in c.lower() and "podman run" in c.lower() + ), "should include container cleanup before rerunning (stop/rm pattern)" + + def test_exit_code_137_oom_mapping(self): + """Skill teaches exit code 137 = OOMKilled, recommend memory increase.""" + c = read_report().lower() + assert ("137" in c or "oom" in c) and "memory" in c, ( + "should map exit 137 to OOM and address memory" + ) + + def test_memory_swap_configuration(self): + """Skill teaches --memory-swap flag for Podman to control total memory + (RAM + swap). Without skill, agents only adjust --memory without swap.""" + c = read_report().lower() + assert "memory-swap" in c or "swap" in c or "memory+swap" in c, ( + "should address memory-swap configuration for container memory limits" + ) + + def test_separate_worker_image(self): + """Skill teaches creating separate container images for different roles + (web vs worker) rather than running all roles from a single image. + Without skill, agents patch the existing single image.""" + c = read_report().lower() + assert any(t in c for t in [ + "separate image", "worker image", "dockerfile.worker", + "dedicated image", "purpose-built", "role-specific", + ]) or ("web" in c and "worker" in c and "image" in c), ( + "should recommend separate images for different container roles" + ) diff --git a/evaluation/with_skills/rh-developer__debug-network/environment/Dockerfile b/evaluation/with_skills/rh-developer__debug-network/environment/Dockerfile new file mode 100644 index 00000000..1cbfefcf --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-network/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__debug-network/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__debug-network/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-network/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__debug-network/instruction.md b/evaluation/with_skills/rh-developer__debug-network/instruction.md new file mode 100644 index 00000000..c74e95ff --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-network/instruction.md @@ -0,0 +1,12 @@ +# Network Debugging Task + +You are a Red Hat developer. An application is returning HTTP 503 errors when accessed via its Route. Investigate the networking configuration to find the issue. + +## Requirements +- Trace the request path (Route → Service → Pod) +- Identify the network misconfiguration +- Recommend a fix + +Use MCP tools to examine the cluster. Document your methodology, findings, and recommendations in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__debug-network/solution/solve.sh b/evaluation/with_skills/rh-developer__debug-network/solution/solve.sh new file mode 100644 index 00000000..ef071a06 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-network/solution/solve.sh @@ -0,0 +1,19 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Network Debug Report + +## Issue: Route 503 for order-service + +### Root Cause +**Service selector mismatch**: Service selector `app: order-svc` does not match pod label `app: order-service`. + +### Diagnosis +1. Route status: Admitted ✓ +2. Service selector: `app: order-svc` +3. Pod labels: `app: order-service` +4. Endpoints: 0 (no matching pods) +5. Test: `oc run test-curl --rm -i --tty --image=curlimages/curl -- curl -v http://order-service.myns.svc.cluster.local:8080` + +### Fix +Update Service selector to match pod labels: `app: order-service` +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__debug-network/task.toml b/evaluation/with_skills/rh-developer__debug-network/task.toml new file mode 100644 index 00000000..d8399696 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-network/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-network" +name = "rh-developer Network Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-network", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__debug-network/tests/llm_judge.py b/evaluation/with_skills/rh-developer__debug-network/tests/llm_judge.py new file mode 100644 index 00000000..3eaeb7d0 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-network/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "route_admitted_vs_exists", + "file": "/root/report.md", + "question": "Does the report check the Route Admitted condition (from the router) rather than just verifying the Route resource exists?", + "reference": "A skilled report checks the Route's Admitted condition which indicates the router has accepted and configured the route. An unskilled report only verifies the Route exists without checking its admission status." + }, + { + "id": "tls_termination_nuances", + "file": "/root/report.md", + "question": "Does the report address TLS termination nuances such as reencrypt requiring destinationCA or passthrough with HTTP backend mismatch?", + "reference": "A skilled report explains that reencrypt TLS termination requires a destinationCA certificate, and that passthrough routes with HTTP-only backends will fail. An unskilled report treats all TLS types as equivalent." + }, + { + "id": "in_cluster_debug_pattern", + "file": "/root/report.md", + "question": "Does the report use a disposable in-cluster curl pod to test internal Service connectivity?", + "reference": "A skilled report creates a temporary curl pod inside the cluster to test Service connectivity from within. An unskilled report only tests external Route access." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__debug-network/tests/test.sh b/evaluation/with_skills/rh-developer__debug-network/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-network/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__debug-network/tests/test_outputs.py b/evaluation/with_skills/rh-developer__debug-network/tests/test_outputs.py new file mode 100644 index 00000000..60293420 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-network/tests/test_outputs.py @@ -0,0 +1,95 @@ +""" +Tests for rh-developer__debug-network per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_network_issue(self): + content = read_report().lower() + assert "503" in content or "network" in content or "route" in content, ( + "report should mention the network issue" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_route_admitted_condition(self): + """Skill teaches Route Admitted condition (from the router) is distinct from + Route just existing. Without skill, agents only check if Route exists.""" + c = read_report().lower() + assert "admitted" in c or "route admitted" in c or ("condition" in c and "route" in c), ( + "should check Route Admitted condition (not just Route existence)" + ) + + def test_empty_endpoints_diagnosis(self): + """Skill teaches checking Endpoints object for empty subsets as the root + cause of 503 errors. Without skill, agents check pod status but not the + Endpoints object directly.""" + c = read_report().lower() + assert ("endpoint" in c and any(t in c for t in [ + "empty", "no endpoint", "none", "no backend", "no subsets", + "0 endpoint", "missing", + ])) or "oc get endpoints" in c or "get ep " in c, ( + "should diagnose empty Endpoints as root cause of 503" + ) + + def test_curl_pod_in_cluster_debug(self): + """Skill teaches using a disposable in-cluster curl pod for debugging + internal connectivity. Without skill, agents test externally only.""" + c = read_report().lower() + assert ("curl" in c and "pod" in c) or "debug pod" in c or "run.*curl" in c or ( + "cluster" in c and "curl" in c + ), "should use in-cluster curl pod for connectivity debugging" + + def test_connectivity_path_tracing(self): + """Skill teaches tracing Route → Service → Endpoints → Pod path.""" + c = read_report().lower() + path_terms = ["route", "service", "endpoint", "pod"] + mentioned = sum(1 for t in path_terms if t in c) + assert mentioned >= 3, "should trace connectivity path (Route→Service→Endpoints→Pod)" + + def test_selector_label_mismatch(self): + """Skill teaches 503 often means selector doesn't match pod labels.""" + c = read_report().lower() + assert any(t in c for t in ["selector", "label", "match", "mismatch"]) and any(t in c for t in [ + "endpoint", "503" + ]), "should identify selector/label mismatch causing no endpoints" + + def test_oc_patch_fix_command(self): + """Skill teaches using oc patch or oc edit for Service selector fixes. + Without skill, agents describe the fix narratively without the actual + command to apply it.""" + c = read_report().lower() + assert any(t in c for t in [ + "oc patch", "oc edit", "kubectl patch", "oc label", + ]) or ("patch" in c and "service" in c), ( + "should include oc patch/edit command for Service selector fix" + ) + + def test_network_policy_awareness(self): + """Skill teaches checking NetworkPolicy as a potential cause of network + issues. Without skill, agents focus only on Service/Route without + considering NetworkPolicy restrictions.""" + c = read_report() + assert "NetworkPolicy" in c or "network policy" in c.lower() or ( + "networkpolic" in c.lower() + ), "should check NetworkPolicy as potential network restriction" diff --git a/evaluation/with_skills/rh-developer__debug-pipeline/environment/Dockerfile b/evaluation/with_skills/rh-developer__debug-pipeline/environment/Dockerfile new file mode 100644 index 00000000..1cbfefcf --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pipeline/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__debug-pipeline/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__debug-pipeline/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pipeline/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__debug-pipeline/instruction.md b/evaluation/with_skills/rh-developer__debug-pipeline/instruction.md new file mode 100644 index 00000000..e65370d4 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pipeline/instruction.md @@ -0,0 +1,12 @@ +# Pipeline Debugging Task + +You are a Red Hat developer. A Tekton PipelineRun has failed. Investigate the pipeline to identify which task failed and why. + +## Requirements +- Examine the PipelineRun status and task results +- Identify the failing task and step +- Recommend a fix or retry strategy + +Use MCP tools to examine the cluster. Document your methodology, findings, and recommendations in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__debug-pipeline/solution/solve.sh b/evaluation/with_skills/rh-developer__debug-pipeline/solution/solve.sh new file mode 100644 index 00000000..f879ab73 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pipeline/solution/solve.sh @@ -0,0 +1,21 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Pipeline Debug Report + +## Failed PipelineRun Analysis + +### Failure Location +- PipelineRun: build-and-deploy-run +- Failed Task: integration-test +- Failed Step: `step-test` (Tekton names step containers as `step-`) + +### Step Logs +Extract from TaskRun pod, container `step-test`. + +### Root Cause +Integration test failed because the service endpoint returned 503. + +### Fix +- Fix the underlying service issue first +- Retry: `tkn pipeline start build-and-deploy --use-pipelinerun build-and-deploy-run` +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__debug-pipeline/task.toml b/evaluation/with_skills/rh-developer__debug-pipeline/task.toml new file mode 100644 index 00000000..d6025adc --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pipeline/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-pipeline" +name = "rh-developer Pipeline Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-pipeline", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__debug-pipeline/tests/llm_judge.py b/evaluation/with_skills/rh-developer__debug-pipeline/tests/llm_judge.py new file mode 100644 index 00000000..ed51f96a --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pipeline/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "step_container_naming", + "file": "/root/report.md", + "question": "Does the report reference the step- container naming convention used in TaskRun pods for targeting logs?", + "reference": "A skilled report knows that Tekton step containers are named step- and uses this to target specific step logs. An unskilled report retrieves pod logs generically without step-level targeting." + }, + { + "id": "taskrun_label_filtering", + "file": "/root/report.md", + "question": "Does the report describe filtering or selecting TaskRuns by their parent PipelineRun (e.g., using tekton.dev/pipelineRun label or equivalent selector), rather than listing all TaskRuns in the namespace?", + "reference": "A skilled report filters TaskRuns by the parent PipelineRun label (tekton.dev/pipelineRun=) to isolate the relevant failure. An unskilled report lists all TaskRuns or checks them one by one without label-based filtering." + }, + { + "id": "hierarchy_diagnosis", + "file": "/root/report.md", + "question": "Does the report systematically drill from PipelineRun → failed TaskRun → step container logs to isolate the failure?", + "reference": "A skilled report follows the PipelineRun→TaskRun→Step hierarchy. An unskilled report checks PipelineRun status without drilling into TaskRun step-level details." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__debug-pipeline/tests/test.sh b/evaluation/with_skills/rh-developer__debug-pipeline/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pipeline/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__debug-pipeline/tests/test_outputs.py b/evaluation/with_skills/rh-developer__debug-pipeline/tests/test_outputs.py new file mode 100644 index 00000000..8112bbd2 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pipeline/tests/test_outputs.py @@ -0,0 +1,53 @@ +""" +Tests for rh-developer__debug-pipeline per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_pipeline(self): + content = read_report().lower() + assert "pipeline" in content, "report should mention pipeline" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_pipelinerun_taskrun_hierarchy(self): + """Skill teaches PipelineRun → TaskRun → Step hierarchy to find failure.""" + c = read_report().lower() + assert any(t in c for t in ["pipelinerun", "pipeline run"]) and any(t in c for t in [ + "taskrun", "task run", "task" + ]), "should drill PipelineRun→TaskRun hierarchy" + + def test_concrete_remediation(self): + """Skill teaches distinguishing transient vs config fix needed.""" + c = read_report().lower() + assert any(t in c for t in ["retry", "rerun", "fix", "remediat", "resolv"]), ( + "should provide remediation guidance" + ) + + def test_taskrun_label_filter(self): + """Docs teach filtering TaskRuns by parent pipeline using + tekton.dev/pipelineRun= label. Without docs, agents list all TaskRuns.""" + c = read_report().lower() + assert "tekton.dev/pipelinerun" in c or ("label" in c and "pipelinerun" in c) or ( + "filter" in c and "taskrun" in c + ), "should filter TaskRuns by pipelineRun label" diff --git a/evaluation/with_skills/rh-developer__debug-pod/environment/Dockerfile b/evaluation/with_skills/rh-developer__debug-pod/environment/Dockerfile new file mode 100644 index 00000000..1cbfefcf --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pod/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__debug-pod/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__debug-pod/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pod/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__debug-pod/instruction.md b/evaluation/with_skills/rh-developer__debug-pod/instruction.md new file mode 100644 index 00000000..9a983f81 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pod/instruction.md @@ -0,0 +1,14 @@ +# Pod Debugging Task + +You are a Red Hat developer. A pod in the `web-frontend` namespace keeps crashing and restarting. Your team needs you to investigate, identify the root cause, and recommend a fix. + +## Requirements +- Check the pod status and identify the failure pattern (exit code, restart count, state) +- Examine container logs, including logs from previous crashed containers +- Analyze resource limits and requests to determine if the crash is resource-related +- Review namespace events for warnings or errors related to the pod +- Identify the root cause and recommend a specific fix + +Use MCP tools to examine the cluster. Document your methodology, findings, and recommended remediation in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__debug-pod/solution/solve.sh b/evaluation/with_skills/rh-developer__debug-pod/solution/solve.sh new file mode 100644 index 00000000..dca1ff71 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pod/solution/solve.sh @@ -0,0 +1,39 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Pod Debug Report + +## Investigation Summary +A pod in the web-frontend namespace is crashing repeatedly. + +## Pod Status +- Namespace: web-frontend +- Pod: web-frontend (CrashLoopBackOff) +- Exit code: 137 (OOMKilled — SIGKILL, memory limit exceeded) +- Restart count: 8 + +## Diagnosis Methodology +1. Listed pods in web-frontend namespace — found pod in CrashLoopBackOff +2. Examined container status — exit code 137, reason: OOMKilled +3. Checked previous container logs — server starts but gets Killed +4. Reviewed events — OOMKilled warning with memory limit 64Mi +5. Analyzed resource limits — memory limit 64Mi is too low for Node.js + +## Root Cause +Exit 137 = 128 + 9 (SIGKILL). The container was OOMKilled because the memory limit of 64Mi is insufficient for a Node.js application. The application starts normally but is killed when memory usage exceeds the limit during initialization of middleware. + +## Events Analysis +- Warning: OOMKilled — Container exceeded memory limit of 64Mi +- Warning: BackOff — Back-off restarting failed container + +## Recommended Fix +Increase the memory limit for the web-frontend deployment: +- Current: requests=32Mi, limits=64Mi +- Recommended: requests=128Mi, limits=256Mi (or higher depending on app needs) + +This can be applied by patching the deployment resource limits. + +## Additional Notes +- The application logs show it starts successfully but is killed during middleware initialization +- No memory leak — the base memory requirement simply exceeds the configured limit +- Consider monitoring memory usage after the fix to right-size the limits +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__debug-pod/task.toml b/evaluation/with_skills/rh-developer__debug-pod/task.toml new file mode 100644 index 00000000..89bac572 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pod/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-pod" +name = "rh-developer Pod Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-pod", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__debug-pod/tests/llm_judge.py b/evaluation/with_skills/rh-developer__debug-pod/tests/llm_judge.py new file mode 100644 index 00000000..3bad1517 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pod/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "previous_logs_methodology", + "file": "/root/report.md", + "question": "Does the report use --previous flag to retrieve logs from crashed containers when restarts are detected?", + "reference": "A skilled report uses --previous to get logs from the terminated container instance when restart count > 0. An unskilled report only checks current container logs, missing crash context." + }, + { + "id": "readiness_endpoint_link", + "file": "/root/report.md", + "question": "Does the report explain that readiness probe failures remove the pod from Service endpoints, causing traffic loss?", + "reference": "A skilled report explains the readiness→endpoints relationship: failed readiness probes remove the pod from Service endpoints. An unskilled report treats readiness as only affecting pod status." + }, + { + "id": "oom_diagnosis_and_fix", + "file": "/root/report.md", + "question": "Does the report map exit code 137 to OOMKilled and provide concrete oc set resources or oc patch commands to increase memory limits?", + "reference": "A skilled report maps 137→OOM and provides actionable oc commands to fix resource limits. An unskilled report may identify OOM but gives vague recommendations." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__debug-pod/tests/test.sh b/evaluation/with_skills/rh-developer__debug-pod/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pod/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__debug-pod/tests/test_outputs.py b/evaluation/with_skills/rh-developer__debug-pod/tests/test_outputs.py new file mode 100644 index 00000000..fda1b3ed --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-pod/tests/test_outputs.py @@ -0,0 +1,75 @@ +""" +Tests for rh-developer__debug-pod per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_pod_or_container(self): + content = read_report().lower() + assert "pod" in content or "container" in content, "report should mention pod or container" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_previous_logs_flag(self): + """Skill teaches using --previous to get logs from crashed container + when restarts > 0. Without skill, agents only check current logs.""" + c = read_report() + assert "--previous" in c or "previous" in c.lower(), ( + "should use --previous flag to get logs from crashed container" + ) + + def test_readiness_removes_endpoints(self): + """Skill teaches that readiness probe failures remove pod from Service + endpoints, causing traffic loss. Without skill, agents miss this link.""" + c = read_report().lower() + assert ("readiness" in c and "endpoint" in c) or ("readiness" in c and "service" in c) or ( + "readiness" in c and "traffic" in c + ), "should explain readiness failures remove Service endpoints" + + def test_exit_137_oomkilled_mapping(self): + """Skill teaches exit code 137 = OOMKilled, map to memory limit.""" + c = read_report().lower() + assert ("137" in c or "oom" in c or "oomkill" in c) and any(t in c for t in [ + "memory", "limit", "increase" + ]), "should map exit 137 to OOMKilled and memory limit" + + def test_concrete_remediation_command(self): + """Skill teaches oc set resources deployment/... --limits=memory=.""" + c = read_report().lower() + assert any(t in c for t in ["oc set resources", "oc patch", "memory=", "limits"]) or ( + "```" in read_report() and "oc" in c + ), "should include concrete oc remediation command" + + def test_resource_analysis(self): + """Skill teaches analyzing memory request/limit for OOM remediation.""" + c = read_report().lower() + assert any(t in c for t in ["limit", "request"]) and any(t in c for t in [ + "memory", "resource", "increase" + ]), "should analyze resource limits for OOM" + + def test_events_correlation(self): + """Skill teaches checking events for scheduling, OOM, and image pull failures.""" + c = read_report().lower() + assert "event" in c and any(t in c for t in [ + "oom", "schedule", "pull", "fail", "kill", "backoff" + ]), "should correlate pod events with failure cause" diff --git a/evaluation/with_skills/rh-developer__debug-rhel/environment/Dockerfile b/evaluation/with_skills/rh-developer__debug-rhel/environment/Dockerfile new file mode 100644 index 00000000..d70159c5 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-rhel/environment/Dockerfile @@ -0,0 +1,74 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhel-system": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhel-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-rhel-mcp.py b/evaluation/with_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-rhel-mcp.py new file mode 100644 index 00000000..314f0e3b --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-rhel-mcp.py @@ -0,0 +1,335 @@ +#!/usr/bin/env python3 +"""Mock RHEL System MCP Server for RHEL debugging evaluation. + +Simulates a RHEL 9 host with a failing service. Exposes system-level +diagnostic tools (systemctl, journalctl, getenforce, firewall-cmd, ausearch) +as MCP tools so the agent can diagnose the issue. + +Scenario: + Host: app-server-01.example.com (RHEL 9.3) + Failing service: myapp.service + Root causes: + 1. SELinux denial: httpd_t cannot bind to port 9090 + 2. Firewall: port 9090/tcp is not open + 3. Service configuration references correct binary but SELinux blocks it +""" + +import json +from typing import Optional + +from fastmcp import FastMCP + +mcp = FastMCP("rhel-system") + +HOST = "app-server-01.example.com" +RHEL_VER = "9.3" + +SERVICES = { + "myapp.service": { + "loaded": True, + "enabled": True, + "active": "failed", + "sub": "failed", + "description": "My Application Service", + "main_pid": 0, + "exit_code": "exited", + "exit_status": 1, + "exec_start": "/opt/myapp/bin/myapp-server --port 9090 --config /etc/myapp/config.yaml", + "user": "myapp", + "group": "myapp", + "working_directory": "/opt/myapp", + "environment": "APP_ENV=production DB_HOST=localhost DB_PORT=5432", + "restart": "on-failure", + "restart_sec": 5, + "status_output": ( + "● myapp.service - My Application Service\n" + " Loaded: loaded (/etc/systemd/system/myapp.service; enabled; preset: disabled)\n" + " Active: failed (Result: exit-code) since Sun 2026-03-01 18:30:45 UTC; 17h ago\n" + " Process: 45678 ExecStart=/opt/myapp/bin/myapp-server --port 9090 --config /etc/myapp/config.yaml (code=exited, status=1/FAILURE)\n" + " Main PID: 45678 (code=exited, status=1/FAILURE)\n" + " CPU: 125ms\n" + "\n" + "Mar 01 18:30:44 app-server-01 systemd[1]: Starting My Application Service...\n" + "Mar 01 18:30:44 app-server-01 myapp-server[45678]: Starting myapp-server v2.1.0\n" + "Mar 01 18:30:44 app-server-01 myapp-server[45678]: Loading configuration from /etc/myapp/config.yaml\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Configuration loaded successfully\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Attempting to bind to 0.0.0.0:9090\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Error: Permission denied: bind to 0.0.0.0:9090\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Fatal: Cannot start server, exiting\n" + "Mar 01 18:30:45 app-server-01 systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE\n" + "Mar 01 18:30:45 app-server-01 systemd[1]: myapp.service: Failed with result 'exit-code'.\n" + ), + }, + "sshd.service": { + "loaded": True, + "enabled": True, + "active": "active", + "sub": "running", + "description": "OpenSSH server daemon", + "main_pid": 1234, + "exit_code": "", + "exit_status": 0, + }, + "firewalld.service": { + "loaded": True, + "enabled": True, + "active": "active", + "sub": "running", + "description": "firewalld - dynamic firewall daemon", + "main_pid": 2345, + "exit_code": "", + "exit_status": 0, + }, + "postgresql.service": { + "loaded": True, + "enabled": True, + "active": "active", + "sub": "running", + "description": "PostgreSQL database server", + "main_pid": 3456, + "exit_code": "", + "exit_status": 0, + }, +} + +JOURNAL_LOGS = { + "myapp.service": ( + "-- Journal begins at Sat 2026-02-28 00:00:00 UTC, ends at Sun 2026-03-02 12:00:00 UTC. --\n" + "Mar 01 18:30:44 app-server-01 systemd[1]: Starting My Application Service...\n" + "Mar 01 18:30:44 app-server-01 myapp-server[45678]: Starting myapp-server v2.1.0\n" + "Mar 01 18:30:44 app-server-01 myapp-server[45678]: Loading configuration from /etc/myapp/config.yaml\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Configuration loaded successfully\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Connecting to database at localhost:5432... OK\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Attempting to bind to 0.0.0.0:9090\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Error: Permission denied: bind to 0.0.0.0:9090\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Fatal: Cannot start server, exiting\n" + "Mar 01 18:30:45 app-server-01 systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE\n" + "Mar 01 18:30:45 app-server-01 systemd[1]: myapp.service: Failed with result 'exit-code'.\n" + "Mar 01 18:30:50 app-server-01 systemd[1]: myapp.service: Scheduled restart job, restart counter is at 1.\n" + "Mar 01 18:30:50 app-server-01 systemd[1]: Starting My Application Service...\n" + "Mar 01 18:30:50 app-server-01 myapp-server[45690]: Starting myapp-server v2.1.0\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Loading configuration from /etc/myapp/config.yaml\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Configuration loaded successfully\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Connecting to database at localhost:5432... OK\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Attempting to bind to 0.0.0.0:9090\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Error: Permission denied: bind to 0.0.0.0:9090\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Fatal: Cannot start server, exiting\n" + "Mar 01 18:30:51 app-server-01 systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE\n" + "Mar 01 18:30:51 app-server-01 systemd[1]: myapp.service: Failed with result 'exit-code'.\n" + "Mar 01 18:30:56 app-server-01 systemd[1]: myapp.service: Scheduled restart job, restart counter is at 2.\n" + "Mar 01 18:30:56 app-server-01 systemd[1]: Starting My Application Service...\n" + "Mar 01 18:30:56 app-server-01 myapp-server[45705]: Starting myapp-server v2.1.0\n" + "Mar 01 18:30:57 app-server-01 myapp-server[45705]: Error: Permission denied: bind to 0.0.0.0:9090\n" + "Mar 01 18:30:57 app-server-01 myapp-server[45705]: Fatal: Cannot start server, exiting\n" + "Mar 01 18:30:57 app-server-01 systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE\n" + "Mar 01 18:30:57 app-server-01 systemd[1]: myapp.service: Failed with result 'exit-code'.\n" + "Mar 01 18:30:57 app-server-01 systemd[1]: myapp.service: Start request repeated too quickly.\n" + "Mar 01 18:30:57 app-server-01 systemd[1]: myapp.service: Failed with result 'exit-code'.\n" + ), +} + + +@mcp.tool() +def systemctl_status(service: str) -> str: + """Get the status of a systemd service (equivalent to 'systemctl status ').""" + svc = SERVICES.get(service) + if not svc: + return f"Unit {service} could not be found." + + if svc.get("status_output"): + return svc["status_output"] + + state = "active (running)" if svc["active"] == "active" else "failed" + return ( + f"● {service} - {svc['description']}\n" + f" Loaded: loaded (/usr/lib/systemd/system/{service}; " + f"{'enabled' if svc['enabled'] else 'disabled'}; preset: disabled)\n" + f" Active: {state}\n" + f" Main PID: {svc['main_pid']}\n" + ) + + +@mcp.tool() +def systemctl_list_failed() -> str: + """List all failed systemd services (equivalent to 'systemctl --failed').""" + failed = [(name, svc) for name, svc in SERVICES.items() if svc["active"] == "failed"] + if not failed: + return "0 loaded units listed." + + lines = [" UNIT LOAD ACTIVE SUB DESCRIPTION"] + for name, svc in failed: + lines.append( + f" {name:<24s} loaded failed failed {svc['description']}" + ) + lines.append(f"\n{len(failed)} loaded units listed.") + return "\n".join(lines) + + +@mcp.tool() +def journalctl(unit: Optional[str] = None, lines: int = 100, priority: Optional[str] = None) -> str: + """Get journal logs, optionally filtered by unit or priority.""" + if unit and unit in JOURNAL_LOGS: + log = JOURNAL_LOGS[unit] + if priority and priority in ("err", "3"): + return "\n".join( + line for line in log.split("\n") + if "Error" in line or "Fatal" in line or "FAILURE" in line or "failed" in line.lower() + ) + return log + + if unit: + return f"-- No entries for unit {unit} --" + + return ( + "-- Journal begins at Sat 2026-02-28 00:00:00 UTC --\n" + "Mar 02 12:00:00 app-server-01 kernel: Linux version 5.14.0-362.el9.x86_64\n" + "Mar 02 12:00:00 app-server-01 systemd[1]: Started system.\n" + ) + + +@mcp.tool() +def getenforce() -> str: + """Get SELinux enforcement mode (equivalent to 'getenforce').""" + return "Enforcing" + + +@mcp.tool() +def ausearch_avc(recent: bool = True, comm: Optional[str] = None) -> str: + """Search for SELinux AVC denial messages (equivalent to 'ausearch -m AVC').""" + denials = [ + { + "timestamp": "Mar 01 18:30:45", + "type": "AVC", + "result": "denied", + "permission": "name_bind", + "scontext": "system_u:system_r:httpd_t:s0", + "tcontext": "system_u:object_r:unreserved_port_t:s0", + "tclass": "tcp_socket", + "comm": "myapp-server", + "port": 9090, + }, + { + "timestamp": "Mar 01 18:30:50", + "type": "AVC", + "result": "denied", + "permission": "name_bind", + "scontext": "system_u:system_r:httpd_t:s0", + "tcontext": "system_u:object_r:unreserved_port_t:s0", + "tclass": "tcp_socket", + "comm": "myapp-server", + "port": 9090, + }, + { + "timestamp": "Mar 01 18:30:56", + "type": "AVC", + "result": "denied", + "permission": "name_bind", + "scontext": "system_u:system_r:httpd_t:s0", + "tcontext": "system_u:object_r:unreserved_port_t:s0", + "tclass": "tcp_socket", + "comm": "myapp-server", + "port": 9090, + }, + ] + + if comm: + denials = [d for d in denials if d["comm"] == comm] + + if not denials: + return "No AVC denials found." + + lines = [] + for d in denials: + lines.append( + f"----\n" + f"time->{d['timestamp']}\n" + f"type=AVC msg=audit: avc: denied {{ {d['permission']} }} for " + f"comm=\"{d['comm']}\" " + f"src={d['port']} " + f"scontext={d['scontext']} " + f"tcontext={d['tcontext']} " + f"tclass={d['tclass']} permissive=0" + ) + return "\n".join(lines) + + +@mcp.tool() +def firewall_cmd_state() -> str: + """Check if firewalld is running (equivalent to 'firewall-cmd --state').""" + return "running" + + +@mcp.tool() +def firewall_cmd_list_all() -> str: + """List all firewall rules for the default zone (equivalent to 'firewall-cmd --list-all').""" + return ( + "public (active)\n" + " target: default\n" + " icmp-block-inversion: no\n" + " interfaces: eth0\n" + " sources: \n" + " services: cockpit dhcpv6-client ssh\n" + " ports: 5432/tcp\n" + " protocols: \n" + " forward: yes\n" + " masquerade: no\n" + " forward-ports: \n" + " source-ports: \n" + " icmp-blocks: \n" + " rich rules: \n" + ) + + +@mcp.tool() +def firewall_cmd_query_port(port: str) -> str: + """Check if a specific port is open in the firewall (e.g. '9090/tcp').""" + open_ports = {"5432/tcp", "22/tcp"} + if port in open_ports: + return "yes" + return "no" + + +@mcp.tool() +def semanage_port_list(port_type: Optional[str] = None) -> str: + """List SELinux port type assignments (equivalent to 'semanage port -l').""" + entries = [ + ("http_port_t", "tcp", "80, 81, 443, 488, 8008, 8009, 8443, 9000"), + ("ssh_port_t", "tcp", "22"), + ("postgresql_port_t", "tcp", "5432"), + ("unreserved_port_t", "tcp", "1024-32767"), + ("unreserved_port_t", "udp", "1024-32767"), + ] + if port_type: + entries = [(t, p, ports) for t, p, ports in entries if t == port_type] + + lines = ["SELinux Port Type Proto Port Number"] + for t, p, ports in entries: + lines.append(f"{t:<26s} {p:<8s} {ports}") + return "\n".join(lines) + + +@mcp.tool() +def system_info() -> str: + """Get basic system information (hostname, OS, kernel, uptime).""" + return json.dumps({ + "hostname": HOST, + "os": f"Red Hat Enterprise Linux {RHEL_VER}", + "kernel": "5.14.0-362.el9.x86_64", + "arch": "x86_64", + "uptime": "15 days, 3:42", + "load_average": "0.45, 0.38, 0.32", + "memory": { + "total": "16384 MB", + "used": "5120 MB", + "free": "8192 MB", + "available": "11264 MB", + }, + "disk": { + "/": {"total": "50G", "used": "18G", "available": "32G", "use_percent": "36%"}, + "/var": {"total": "100G", "used": "45G", "available": "55G", "use_percent": "45%"}, + }, + }, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/with_skills/rh-developer__debug-rhel/instruction.md b/evaluation/with_skills/rh-developer__debug-rhel/instruction.md new file mode 100644 index 00000000..ca2ade3a --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-rhel/instruction.md @@ -0,0 +1,12 @@ +# RHEL System Debugging Task + +You are a Red Hat developer. A RHEL-based service is failing to start or accept connections. Investigate the system configuration to identify the issue. + +## Requirements +- Check service status, SELinux, and firewall configuration +- Identify the system-level root cause +- Recommend a fix + +Use available tools to examine the environment. Document your methodology, findings, and recommendations in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__debug-rhel/solution/solve.sh b/evaluation/with_skills/rh-developer__debug-rhel/solution/solve.sh new file mode 100644 index 00000000..350dd5d5 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-rhel/solution/solve.sh @@ -0,0 +1,36 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# RHEL Debug Report + +## Issue: Flask app can't bind to port 8080 + +### Systemd Check +```bash +systemctl status flask-app +``` + +### Journal Logs +```bash +journalctl -u flask-app -n 100 +``` + +### SELinux Check +```bash +getenforce +ausearch -m AVC -ts recent +# Found: denied bind to port 8080 +``` + +### Fix: Add port to SELinux +```bash +sudo semanage port -a -t http_port_t -p tcp 8080 +sudo restorecon -Rv /opt/flask-app +``` + +### Firewall Check +```bash +sudo firewall-cmd --list-all +sudo firewall-cmd --permanent --add-port=8080/tcp +sudo firewall-cmd --reload +``` +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__debug-rhel/task.toml b/evaluation/with_skills/rh-developer__debug-rhel/task.toml new file mode 100644 index 00000000..32fb504b --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-rhel/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-rhel" +name = "rh-developer RHEL Deployment Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-rhel", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__debug-rhel/tests/llm_judge.py b/evaluation/with_skills/rh-developer__debug-rhel/tests/llm_judge.py new file mode 100644 index 00000000..e170f4bb --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-rhel/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "ausearch_avc_workflow", + "file": "/root/report.md", + "question": "Does the report use ausearch -m AVC for investigating SELinux denials, rather than generic SELinux commands?", + "reference": "A skilled report uses 'ausearch -m AVC -ts recent' to find recent SELinux AVC denials. An unskilled report checks getenforce or sestatus without examining specific denials." + }, + { + "id": "semanage_port_labeling", + "file": "/root/report.md", + "question": "Does the report use semanage port for labeling nonstandard bind ports in SELinux?", + "reference": "A skilled report uses 'semanage port -a -t http_port_t -p tcp ' for nonstandard ports. An unskilled report suggests disabling SELinux or only uses setsebool." + }, + { + "id": "concrete_rhel_remediation", + "file": "/root/report.md", + "question": "Does the report provide concrete systemctl, firewall-cmd, and semanage/restorecon commands for RHEL troubleshooting?", + "reference": "A skilled report provides specific commands for each layer: systemctl restart for services, firewall-cmd --add-port for networking, semanage+restorecon for SELinux. An unskilled report gives high-level suggestions." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__debug-rhel/tests/test.sh b/evaluation/with_skills/rh-developer__debug-rhel/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-rhel/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__debug-rhel/tests/test_outputs.py b/evaluation/with_skills/rh-developer__debug-rhel/tests/test_outputs.py new file mode 100644 index 00000000..6ba9216b --- /dev/null +++ b/evaluation/with_skills/rh-developer__debug-rhel/tests/test_outputs.py @@ -0,0 +1,97 @@ +""" +Tests for rh-developer__debug-rhel per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_rhel_or_system(self): + content = read_report().lower() + assert "rhel" in content or "system" in content or "service" in content, ( + "report should mention RHEL or system" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_ausearch_avc_command(self): + """Skill teaches ausearch -m AVC -ts recent for recent SELinux denials. + Without skill, agents use generic SELinux checks without ausearch.""" + c = read_report().lower() + assert "ausearch" in c, ( + "should use ausearch for SELinux AVC denial investigation" + ) + + def test_semanage_port_labeling(self): + """Skill teaches semanage port for nonstandard bind port SELinux labeling. + Without skill, agents skip port-level SELinux context management.""" + c = read_report().lower() + assert "semanage port" in c or ("semanage" in c and "port" in c), ( + "should use semanage port for nonstandard port SELinux labeling" + ) + + def test_systemd_journal_workflow(self): + """Skill teaches systemctl status + journalctl -u for service logs.""" + c = read_report().lower() + assert any(t in c for t in ["systemctl", "journalctl"]) and any(t in c for t in [ + "status", "-u", "service", "log" + ]), "should use systemd/journal workflow" + + def test_firewall_cmd(self): + """Skill teaches firewall-cmd for port management.""" + c = read_report().lower() + assert "firewall-cmd" in c or "firewall" in c, ( + "should check firewall configuration" + ) + + def test_concrete_remediation(self): + """Skill teaches concrete remediation commands for RHEL issues.""" + c = read_report().lower() + assert any(t in c for t in ["systemctl restart", "firewall-cmd", "semanage", "restorecon"]) or ( + "```" in read_report() and any(t in c for t in ["sudo", "systemctl"]) + ), "should include concrete RHEL remediation commands" + + def test_permanent_firewall_flag(self): + """Skill teaches using --permanent flag with firewall-cmd to persist rules + across reboots. Without skill, agents use firewall-cmd without --permanent, + creating rules that are lost on reboot.""" + c = read_report() + assert "--permanent" in c, ( + "should use --permanent flag with firewall-cmd for persistent rules" + ) + + def test_http_port_t_selinux_type(self): + """Skill teaches the specific SELinux type http_port_t for web service ports. + Without skill, agents use generic semanage commands without specifying the + correct SELinux type for HTTP ports.""" + c = read_report() + assert "http_port_t" in c, ( + "should reference http_port_t SELinux type for port labeling" + ) + + def test_getenforce_check(self): + """Skill teaches using getenforce to verify SELinux mode (Enforcing/Permissive) + as a first diagnostic step. Without skill, agents jump to specific SELinux + fixes without verifying the enforcement mode.""" + c = read_report().lower() + assert "getenforce" in c, ( + "should use getenforce to check SELinux enforcement mode" + ) diff --git a/evaluation/with_skills/rh-developer__deploy/environment/Dockerfile b/evaluation/with_skills/rh-developer__deploy/environment/Dockerfile new file mode 100644 index 00000000..1cbfefcf --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__deploy/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__deploy/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/buildconfig.yaml.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/buildconfig.yaml.template new file mode 100644 index 00000000..b3294eb2 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/buildconfig.yaml.template @@ -0,0 +1,38 @@ +apiVersion: build.openshift.io/v1 +kind: BuildConfig +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: build + app.kubernetes.io/part-of: ${APP_NAME} +spec: + source: + type: Git + git: + uri: ${GIT_URL} + ref: ${GIT_BRANCH} + strategy: + type: Source + sourceStrategy: + from: + kind: DockerImage + name: ${BUILDER_IMAGE} + env: [] + output: + to: + kind: ImageStreamTag + name: ${APP_NAME}:latest + triggers: + - type: ConfigChange + - type: ImageChange + runPolicy: Serial + resources: + limits: + memory: "1Gi" + cpu: "1" + requests: + memory: "512Mi" + cpu: "500m" diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/deployment.yaml.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/deployment.yaml.template new file mode 100644 index 00000000..eb3b481a --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: application + app.kubernetes.io/part-of: ${APP_NAME} + annotations: + image.openshift.io/triggers: | + [{"from":{"kind":"ImageStreamTag","name":"${APP_NAME}:latest"},"fieldPath":"spec.template.spec.containers[0].image"}] +spec: + replicas: ${REPLICAS} + selector: + matchLabels: + app: ${APP_NAME} + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + template: + metadata: + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + spec: + containers: + - name: ${APP_NAME} + image: image-registry.openshift-image-registry.svc:5000/${NAMESPACE}/${APP_NAME}:latest + ports: + - containerPort: ${CONTAINER_PORT} + protocol: TCP + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + livenessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + readinessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 + env: [] + restartPolicy: Always + terminationGracePeriodSeconds: 30 diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/Chart.yaml.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/Chart.yaml.template new file mode 100644 index 00000000..1aa22dd1 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/Chart.yaml.template @@ -0,0 +1,13 @@ +apiVersion: v2 +name: ${APP_NAME} +description: ${APP_DESCRIPTION} +type: application +version: 0.1.0 +appVersion: "${APP_VERSION}" +keywords: + - ${LANGUAGE} + - ${FRAMEWORK} + - openshift +maintainers: + - name: ${MAINTAINER_NAME} + email: ${MAINTAINER_EMAIL} diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/NOTES.txt.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/NOTES.txt.template new file mode 100644 index 00000000..154e628d --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/NOTES.txt.template @@ -0,0 +1,32 @@ +Congratulations! Your application {{ include "${APP_NAME}.fullname" . }} has been deployed. + +{{- if .Values.route.enabled }} + +Access your application at: +{{- if .Values.route.host }} + https://{{ .Values.route.host }} +{{- else }} + Run: oc get route {{ include "${APP_NAME}.fullname" . }} -o jsonpath='{.spec.host}' +{{- end }} + +{{- else }} + +Your application is available internally at: + {{ include "${APP_NAME}.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.service.port }} + +To expose it externally, create a Route or set route.enabled=true. + +{{- end }} + +Useful commands: + # View pods + oc get pods -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} + + # View logs + oc logs -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} -f + + # Upgrade release + helm upgrade {{ .Release.Name }} ./{{ .Chart.Name }} -f values.yaml + + # Uninstall release + helm uninstall {{ .Release.Name }} diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/_helpers.tpl.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/_helpers.tpl.template new file mode 100644 index 00000000..15873b10 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/_helpers.tpl.template @@ -0,0 +1,60 @@ +{{/* +Expand the name of the chart. +*/}} +{{- define "${APP_NAME}.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Create a default fully qualified app name. +*/}} +{{- define "${APP_NAME}.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{/* +Create chart name and version as used by the chart label. +*/}} +{{- define "${APP_NAME}.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Common labels +*/}} +{{- define "${APP_NAME}.labels" -}} +helm.sh/chart: {{ include "${APP_NAME}.chart" . }} +{{ include "${APP_NAME}.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{/* +Selector labels +*/}} +{{- define "${APP_NAME}.selectorLabels" -}} +app.kubernetes.io/name: {{ include "${APP_NAME}.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{/* +Create the name of the service account to use +*/}} +{{- define "${APP_NAME}.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "${APP_NAME}.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/deployment.yaml.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/deployment.yaml.template new file mode 100644 index 00000000..a6cbd868 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + selector: + matchLabels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 8 }} + spec: + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + serviceAccountName: {{ include "${APP_NAME}.serviceAccountName" . }} + securityContext: + {{- toYaml .Values.podSecurityContext | nindent 8 }} + containers: + - name: {{ .Chart.Name }} + securityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + ports: + - name: http + containerPort: {{ .Values.service.port }} + protocol: TCP + livenessProbe: + {{- toYaml .Values.livenessProbe | nindent 12 }} + readinessProbe: + {{- toYaml .Values.readinessProbe | nindent 12 }} + resources: + {{- toYaml .Values.resources | nindent 12 }} + {{- with .Values.env }} + env: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/route.yaml.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/route.yaml.template new file mode 100644 index 00000000..e2bab29a --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/route.yaml.template @@ -0,0 +1,24 @@ +{{- if .Values.route.enabled }} +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if .Values.route.host }} + host: {{ .Values.route.host }} + {{- end }} + to: + kind: Service + name: {{ include "${APP_NAME}.fullname" . }} + weight: 100 + port: + targetPort: http + {{- with .Values.route.tls }} + tls: + termination: {{ .termination }} + insecureEdgeTerminationPolicy: {{ .insecureEdgeTerminationPolicy }} + {{- end }} + wildcardPolicy: None +{{- end }} diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/service.yaml.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/service.yaml.template new file mode 100644 index 00000000..837bc888 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/templates/service.yaml.template @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Service +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + - port: {{ .Values.service.port }} + targetPort: http + protocol: TCP + name: http + selector: + {{- include "${APP_NAME}.selectorLabels" . | nindent 4 }} diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/values.yaml.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/values.yaml.template new file mode 100644 index 00000000..1cca6017 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/helm/values.yaml.template @@ -0,0 +1,67 @@ +# Default values for ${APP_NAME} +replicaCount: 1 + +image: + repository: ${IMAGE_REPOSITORY} + pullPolicy: IfNotPresent + tag: "${IMAGE_TAG}" + +imagePullSecrets: [] +nameOverride: "" +fullnameOverride: "" + +serviceAccount: + create: true + annotations: {} + name: "" + +podAnnotations: {} +podSecurityContext: {} +securityContext: {} + +service: + type: ClusterIP + port: ${CONTAINER_PORT} + +route: + enabled: true + host: "" + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + +resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + +livenessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 30 + periodSeconds: 10 + +readinessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 5 + periodSeconds: 5 + +autoscaling: + enabled: false + minReplicas: 1 + maxReplicas: 5 + targetCPUUtilizationPercentage: 80 + +nodeSelector: {} +tolerations: [] +affinity: {} + +env: [] +# - name: MY_VAR +# value: "my-value" diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/imagestream.yaml.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/imagestream.yaml.template new file mode 100644 index 00000000..46572193 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/imagestream.yaml.template @@ -0,0 +1,13 @@ +apiVersion: image.openshift.io/v1 +kind: ImageStream +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: image + app.kubernetes.io/part-of: ${APP_NAME} +spec: + lookupPolicy: + local: false diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/route.yaml.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/route.yaml.template new file mode 100644 index 00000000..7c53d2e7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/route.yaml.template @@ -0,0 +1,21 @@ +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: route + app.kubernetes.io/part-of: ${APP_NAME} +spec: + to: + kind: Service + name: ${APP_NAME} + weight: 100 + port: + targetPort: http + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/service.yaml.template b/evaluation/with_skills/rh-developer__deploy/environment/templates/service.yaml.template new file mode 100644 index 00000000..7e1cf371 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/service.yaml.template @@ -0,0 +1,20 @@ +apiVersion: v1 +kind: Service +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: service + app.kubernetes.io/part-of: ${APP_NAME} +spec: + selector: + app: ${APP_NAME} + ports: + - name: http + port: ${CONTAINER_PORT} + targetPort: ${CONTAINER_PORT} + protocol: TCP + type: ClusterIP + sessionAffinity: None diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/systemd/systemd-container-rootful.service b/evaluation/with_skills/rh-developer__deploy/environment/templates/systemd/systemd-container-rootful.service new file mode 100644 index 00000000..c1e8fe8f --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/systemd/systemd-container-rootful.service @@ -0,0 +1,27 @@ +# Rootful Podman container managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/systemd/systemd-container-rootless.service b/evaluation/with_skills/rh-developer__deploy/environment/templates/systemd/systemd-container-rootless.service new file mode 100644 index 00000000..ca9dc371 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/systemd/systemd-container-rootless.service @@ -0,0 +1,27 @@ +# Rootless Podman container managed by systemd (user service) +# Location: ~/.config/systemd/user/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=default.target diff --git a/evaluation/with_skills/rh-developer__deploy/environment/templates/systemd/systemd-native.service b/evaluation/with_skills/rh-developer__deploy/environment/templates/systemd/systemd-native.service new file mode 100644 index 00000000..c55cfc07 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/environment/templates/systemd/systemd-native.service @@ -0,0 +1,39 @@ +# Native application managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${SERVICE_USER} - User to run the service as +# ${APP_PATH} - Application install path (e.g., /opt/app-name) +# ${PORT} - Application listen port +# ${START_COMMAND} - Application start command +# +# Start command examples by language: +# Node.js: /usr/bin/node ${APP_PATH}/server.js +# Python: /usr/bin/python3 ${APP_PATH}/app.py +# Java: /usr/bin/java -jar ${APP_PATH}/app.jar +# Go: ${APP_PATH}/binary-name + +[Unit] +Description=${APP_NAME} Service +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=${SERVICE_USER} +WorkingDirectory=${APP_PATH} +Environment=PORT=${PORT} +ExecStart=${START_COMMAND} +Restart=always +RestartSec=5 + +# Security hardening +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +ReadWritePaths=${APP_PATH} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/with_skills/rh-developer__deploy/instruction.md b/evaluation/with_skills/rh-developer__deploy/instruction.md new file mode 100644 index 00000000..f84c6177 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/instruction.md @@ -0,0 +1,14 @@ +# Application Deployment Task + +You are a Red Hat developer. Your team needs to deploy a web application to OpenShift that will be accessible to external users via HTTPS. + +## Requirements +- Examine the target namespace and available resources on the cluster +- Define the deployment: container image, replica count, resource requests and limits, and health checks +- Configure a Service to expose the application pods internally +- Configure a Route for external HTTPS access with appropriate TLS settings +- Verify the deployment plan addresses image pull access and correct container port mapping + +Document your deployment plan and the complete resource definitions in `/root/report.md`. + +Use MCP tools to examine the cluster. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__deploy/solution/solve.sh b/evaluation/with_skills/rh-developer__deploy/solution/solve.sh new file mode 100644 index 00000000..b8f9ec1b --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/solution/solve.sh @@ -0,0 +1,61 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Deployment Plan: customer-portal + +## Deployment +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: customer-portal +spec: + replicas: 1 + selector: + matchLabels: + app: customer-portal + template: + metadata: + labels: + app: customer-portal + spec: + containers: + - name: customer-portal + image: image-registry.openshift-image-registry.svc:5000/myproject/customer-portal:latest + ports: + - containerPort: 3000 +``` + +## Service +```yaml +apiVersion: v1 +kind: Service +metadata: + name: customer-portal +spec: + selector: + app: customer-portal + ports: + - port: 3000 + targetPort: 3000 +``` + +## Route +```yaml +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: customer-portal +spec: + to: + kind: Service + name: customer-portal + port: + targetPort: 3000 + tls: + termination: edge +``` + +### Internal DNS: `http://customer-portal.myproject.svc.cluster.local:3000` + +### On failure: Debug Pod (/debug-pod) or Debug Network (/debug-network) +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__deploy/task.toml b/evaluation/with_skills/rh-developer__deploy/task.toml new file mode 100644 index 00000000..86e6c127 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__deploy" +name = "rh-developer Deployment Planning Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "deploy", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__deploy/tests/llm_judge.py b/evaluation/with_skills/rh-developer__deploy/tests/llm_judge.py new file mode 100644 index 00000000..5ce75615 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "security_hardening", + "file": "/root/report.md", + "question": "Does the report include deployment security hardening such as runAsNonRoot, allowPrivilegeEscalation: false, seccompProfile, or insecureEdgeTerminationPolicy: Redirect on the Route?", + "reference": "A skilled report includes security context on the Deployment (runAsNonRoot: true, allowPrivilegeEscalation: false) and configures Route with insecureEdgeTerminationPolicy: Redirect. An unskilled report creates basic Deployment+Service+Route without security hardening." + }, + { + "id": "deployment_service_route", + "file": "/root/report.md", + "question": "Does the report create all three resources (Deployment, Service, Route) with correct selector/port alignment?", + "reference": "A skilled report defines Deployment + Service + Route with matching selectors, targetPort, and containerPort. An unskilled report may miss selector alignment or skip the Route." + }, + { + "id": "tls_and_port_detection", + "file": "/root/report.md", + "question": "Does the report address TLS termination for the Route and port detection based on framework defaults?", + "reference": "A skilled report configures TLS (edge/passthrough) on the Route and detects the application port from framework conventions. An unskilled report hardcodes port 8080 and skips TLS." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__deploy/tests/test.sh b/evaluation/with_skills/rh-developer__deploy/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__deploy/tests/test_outputs.py b/evaluation/with_skills/rh-developer__deploy/tests/test_outputs.py new file mode 100644 index 00000000..01ea8257 --- /dev/null +++ b/evaluation/with_skills/rh-developer__deploy/tests/test_outputs.py @@ -0,0 +1,87 @@ +""" +Tests for rh-developer__deploy per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_deploy(self): + content = read_report().lower() + assert "deploy" in content, "report should mention deployment" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_insecure_redirect_policy(self): + """Skill teaches insecureEdgeTerminationPolicy: Redirect on Route to force + HTTP→HTTPS. Without skill, agents create Routes without redirect policy, + leaving HTTP access open.""" + c = read_report() + assert "insecureEdgeTerminationPolicy" in c or ( + "Redirect" in c and ("http" in c.lower() and "https" in c.lower()) + ), "should configure insecureEdgeTerminationPolicy: Redirect on Route" + + def test_framework_port_detection(self): + """Skill teaches port inference by framework defaults (Node 3000/8080, + Python 5000/8000, Java 8080). Without skill, agents hardcode 8080.""" + c = read_report().lower() + assert any(t in c for t in ["port", "8080", "3000", "5000"]) and any(t in c for t in [ + "detect", "expose", "listen", "framework", "default", "infer" + ]), "should address port detection from framework defaults" + + def test_deployment_service_route_triad(self): + """Skill teaches creating Deployment, Service, Route in sequence.""" + c = read_report().lower() + assert any(t in c for t in ["deployment"]) and "service" in c and any(t in c for t in [ + "route", "external", "https" + ]), "should define Deployment + Service + Route" + + def test_selector_alignment(self): + """Skill teaches Service selector must match Deployment pod labels.""" + c = read_report().lower() + assert any(t in c for t in ["selector", "label", "targetport", "target port"]) or ( + "service" in c and "port" in c and "match" in c + ), "should address selector/port alignment" + + def test_tls_route_config(self): + """Skill teaches Route with TLS termination (edge/passthrough).""" + c = read_report().lower() + assert any(t in c for t in ["tls", "https", "edge", "termination"]), ( + "should address Route TLS for external access" + ) + + def test_hpa_autoscaling(self): + """Skill teaches including HorizontalPodAutoscaler configuration for + production deployments. Without skill, agents set static replica count + without autoscaling.""" + c = read_report() + assert "HorizontalPodAutoscaler" in c or "autoscaling/v2" in c or ( + "hpa" in c.lower() and "autoscal" in c.lower() + ), "should include HorizontalPodAutoscaler for production scaling" + + def test_hsts_security_headers(self): + """Skill teaches HSTS headers or Strict-Transport-Security configuration + on OpenShift Routes. Without skill, agents skip transport security headers.""" + c = read_report() + assert any(t in c for t in [ + "HSTS", "Strict-Transport-Security", "hsts", + "haproxy.router.openshift.io", + ]), "should configure HSTS or transport security headers on Route" diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/Dockerfile b/evaluation/with_skills/rh-developer__detect-project/environment/Dockerfile new file mode 100644 index 00000000..608ae0df --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/Dockerfile @@ -0,0 +1,71 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs +COPY sample-project /root/project + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__detect-project/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/.s2i/environment b/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/.s2i/environment new file mode 100644 index 00000000..a16a265c --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/.s2i/environment @@ -0,0 +1 @@ +APP_FILE=app.py diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/Dockerfile b/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/Dockerfile new file mode 100644 index 00000000..a7fb87b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/Dockerfile @@ -0,0 +1,9 @@ +FROM python:3.11-slim + +WORKDIR /app +COPY requirements.txt . +RUN pip install -r requirements.txt +COPY . . + +EXPOSE 8080 +CMD ["gunicorn", "-b", "0.0.0.0:8080", "app:app"] diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/app.py b/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/app.py new file mode 100644 index 00000000..4761fe8a --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/app.py @@ -0,0 +1,12 @@ +from flask import Flask + +app = Flask(__name__) + + +@app.route("/") +def hello(): + return "Hello, World!" + + +if __name__ == "__main__": + app.run(host="0.0.0.0", port=8080) diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/requirements.txt b/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/requirements.txt new file mode 100644 index 00000000..cb04ebda --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/requirements.txt @@ -0,0 +1,3 @@ +flask +gunicorn +psycopg2-binary diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/tests/test_app.py b/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/tests/test_app.py new file mode 100644 index 00000000..5e8fbc93 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/sample-project/tests/test_app.py @@ -0,0 +1,9 @@ +import pytest +from app import app + + +def test_hello(): + with app.test_client() as client: + r = client.get("/") + assert r.status_code == 200 + assert b"Hello" in r.data diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/buildconfig.yaml.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/buildconfig.yaml.template new file mode 100644 index 00000000..b3294eb2 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/buildconfig.yaml.template @@ -0,0 +1,38 @@ +apiVersion: build.openshift.io/v1 +kind: BuildConfig +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: build + app.kubernetes.io/part-of: ${APP_NAME} +spec: + source: + type: Git + git: + uri: ${GIT_URL} + ref: ${GIT_BRANCH} + strategy: + type: Source + sourceStrategy: + from: + kind: DockerImage + name: ${BUILDER_IMAGE} + env: [] + output: + to: + kind: ImageStreamTag + name: ${APP_NAME}:latest + triggers: + - type: ConfigChange + - type: ImageChange + runPolicy: Serial + resources: + limits: + memory: "1Gi" + cpu: "1" + requests: + memory: "512Mi" + cpu: "500m" diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/deployment.yaml.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/deployment.yaml.template new file mode 100644 index 00000000..eb3b481a --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: application + app.kubernetes.io/part-of: ${APP_NAME} + annotations: + image.openshift.io/triggers: | + [{"from":{"kind":"ImageStreamTag","name":"${APP_NAME}:latest"},"fieldPath":"spec.template.spec.containers[0].image"}] +spec: + replicas: ${REPLICAS} + selector: + matchLabels: + app: ${APP_NAME} + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + template: + metadata: + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + spec: + containers: + - name: ${APP_NAME} + image: image-registry.openshift-image-registry.svc:5000/${NAMESPACE}/${APP_NAME}:latest + ports: + - containerPort: ${CONTAINER_PORT} + protocol: TCP + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + livenessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + readinessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 + env: [] + restartPolicy: Always + terminationGracePeriodSeconds: 30 diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/Chart.yaml.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/Chart.yaml.template new file mode 100644 index 00000000..1aa22dd1 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/Chart.yaml.template @@ -0,0 +1,13 @@ +apiVersion: v2 +name: ${APP_NAME} +description: ${APP_DESCRIPTION} +type: application +version: 0.1.0 +appVersion: "${APP_VERSION}" +keywords: + - ${LANGUAGE} + - ${FRAMEWORK} + - openshift +maintainers: + - name: ${MAINTAINER_NAME} + email: ${MAINTAINER_EMAIL} diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/NOTES.txt.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/NOTES.txt.template new file mode 100644 index 00000000..154e628d --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/NOTES.txt.template @@ -0,0 +1,32 @@ +Congratulations! Your application {{ include "${APP_NAME}.fullname" . }} has been deployed. + +{{- if .Values.route.enabled }} + +Access your application at: +{{- if .Values.route.host }} + https://{{ .Values.route.host }} +{{- else }} + Run: oc get route {{ include "${APP_NAME}.fullname" . }} -o jsonpath='{.spec.host}' +{{- end }} + +{{- else }} + +Your application is available internally at: + {{ include "${APP_NAME}.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.service.port }} + +To expose it externally, create a Route or set route.enabled=true. + +{{- end }} + +Useful commands: + # View pods + oc get pods -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} + + # View logs + oc logs -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} -f + + # Upgrade release + helm upgrade {{ .Release.Name }} ./{{ .Chart.Name }} -f values.yaml + + # Uninstall release + helm uninstall {{ .Release.Name }} diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/_helpers.tpl.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/_helpers.tpl.template new file mode 100644 index 00000000..15873b10 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/_helpers.tpl.template @@ -0,0 +1,60 @@ +{{/* +Expand the name of the chart. +*/}} +{{- define "${APP_NAME}.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Create a default fully qualified app name. +*/}} +{{- define "${APP_NAME}.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{/* +Create chart name and version as used by the chart label. +*/}} +{{- define "${APP_NAME}.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Common labels +*/}} +{{- define "${APP_NAME}.labels" -}} +helm.sh/chart: {{ include "${APP_NAME}.chart" . }} +{{ include "${APP_NAME}.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{/* +Selector labels +*/}} +{{- define "${APP_NAME}.selectorLabels" -}} +app.kubernetes.io/name: {{ include "${APP_NAME}.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{/* +Create the name of the service account to use +*/}} +{{- define "${APP_NAME}.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "${APP_NAME}.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/deployment.yaml.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/deployment.yaml.template new file mode 100644 index 00000000..a6cbd868 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + selector: + matchLabels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 8 }} + spec: + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + serviceAccountName: {{ include "${APP_NAME}.serviceAccountName" . }} + securityContext: + {{- toYaml .Values.podSecurityContext | nindent 8 }} + containers: + - name: {{ .Chart.Name }} + securityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + ports: + - name: http + containerPort: {{ .Values.service.port }} + protocol: TCP + livenessProbe: + {{- toYaml .Values.livenessProbe | nindent 12 }} + readinessProbe: + {{- toYaml .Values.readinessProbe | nindent 12 }} + resources: + {{- toYaml .Values.resources | nindent 12 }} + {{- with .Values.env }} + env: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/route.yaml.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/route.yaml.template new file mode 100644 index 00000000..e2bab29a --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/route.yaml.template @@ -0,0 +1,24 @@ +{{- if .Values.route.enabled }} +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if .Values.route.host }} + host: {{ .Values.route.host }} + {{- end }} + to: + kind: Service + name: {{ include "${APP_NAME}.fullname" . }} + weight: 100 + port: + targetPort: http + {{- with .Values.route.tls }} + tls: + termination: {{ .termination }} + insecureEdgeTerminationPolicy: {{ .insecureEdgeTerminationPolicy }} + {{- end }} + wildcardPolicy: None +{{- end }} diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/service.yaml.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/service.yaml.template new file mode 100644 index 00000000..837bc888 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/templates/service.yaml.template @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Service +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + - port: {{ .Values.service.port }} + targetPort: http + protocol: TCP + name: http + selector: + {{- include "${APP_NAME}.selectorLabels" . | nindent 4 }} diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/values.yaml.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/values.yaml.template new file mode 100644 index 00000000..1cca6017 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/helm/values.yaml.template @@ -0,0 +1,67 @@ +# Default values for ${APP_NAME} +replicaCount: 1 + +image: + repository: ${IMAGE_REPOSITORY} + pullPolicy: IfNotPresent + tag: "${IMAGE_TAG}" + +imagePullSecrets: [] +nameOverride: "" +fullnameOverride: "" + +serviceAccount: + create: true + annotations: {} + name: "" + +podAnnotations: {} +podSecurityContext: {} +securityContext: {} + +service: + type: ClusterIP + port: ${CONTAINER_PORT} + +route: + enabled: true + host: "" + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + +resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + +livenessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 30 + periodSeconds: 10 + +readinessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 5 + periodSeconds: 5 + +autoscaling: + enabled: false + minReplicas: 1 + maxReplicas: 5 + targetCPUUtilizationPercentage: 80 + +nodeSelector: {} +tolerations: [] +affinity: {} + +env: [] +# - name: MY_VAR +# value: "my-value" diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/imagestream.yaml.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/imagestream.yaml.template new file mode 100644 index 00000000..46572193 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/imagestream.yaml.template @@ -0,0 +1,13 @@ +apiVersion: image.openshift.io/v1 +kind: ImageStream +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: image + app.kubernetes.io/part-of: ${APP_NAME} +spec: + lookupPolicy: + local: false diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/route.yaml.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/route.yaml.template new file mode 100644 index 00000000..7c53d2e7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/route.yaml.template @@ -0,0 +1,21 @@ +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: route + app.kubernetes.io/part-of: ${APP_NAME} +spec: + to: + kind: Service + name: ${APP_NAME} + weight: 100 + port: + targetPort: http + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/service.yaml.template b/evaluation/with_skills/rh-developer__detect-project/environment/templates/service.yaml.template new file mode 100644 index 00000000..7e1cf371 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/service.yaml.template @@ -0,0 +1,20 @@ +apiVersion: v1 +kind: Service +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: service + app.kubernetes.io/part-of: ${APP_NAME} +spec: + selector: + app: ${APP_NAME} + ports: + - name: http + port: ${CONTAINER_PORT} + targetPort: ${CONTAINER_PORT} + protocol: TCP + type: ClusterIP + sessionAffinity: None diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootful.service b/evaluation/with_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootful.service new file mode 100644 index 00000000..c1e8fe8f --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootful.service @@ -0,0 +1,27 @@ +# Rootful Podman container managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootless.service b/evaluation/with_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootless.service new file mode 100644 index 00000000..ca9dc371 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootless.service @@ -0,0 +1,27 @@ +# Rootless Podman container managed by systemd (user service) +# Location: ~/.config/systemd/user/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=default.target diff --git a/evaluation/with_skills/rh-developer__detect-project/environment/templates/systemd/systemd-native.service b/evaluation/with_skills/rh-developer__detect-project/environment/templates/systemd/systemd-native.service new file mode 100644 index 00000000..c55cfc07 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/environment/templates/systemd/systemd-native.service @@ -0,0 +1,39 @@ +# Native application managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${SERVICE_USER} - User to run the service as +# ${APP_PATH} - Application install path (e.g., /opt/app-name) +# ${PORT} - Application listen port +# ${START_COMMAND} - Application start command +# +# Start command examples by language: +# Node.js: /usr/bin/node ${APP_PATH}/server.js +# Python: /usr/bin/python3 ${APP_PATH}/app.py +# Java: /usr/bin/java -jar ${APP_PATH}/app.jar +# Go: ${APP_PATH}/binary-name + +[Unit] +Description=${APP_NAME} Service +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=${SERVICE_USER} +WorkingDirectory=${APP_PATH} +Environment=PORT=${PORT} +ExecStart=${START_COMMAND} +Restart=always +RestartSec=5 + +# Security hardening +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +ReadWritePaths=${APP_PATH} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/with_skills/rh-developer__detect-project/instruction.md b/evaluation/with_skills/rh-developer__detect-project/instruction.md new file mode 100644 index 00000000..04695ff5 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/instruction.md @@ -0,0 +1,13 @@ +# Project Detection Task + +You are a Red Hat developer. A colleague has handed you a source repository and asked you to figure out what it is and how to deploy it to OpenShift. + +## Requirements +- Examine the project files to identify the programming language, version, and package manager +- Detect the application framework (e.g., Flask, Express, Spring) and build system +- Based on what you find, recommend a deployment strategy: which builder image or base image to use, what build process to follow, and how the application should be started +- Explain your reasoning for the recommended approach + +Document your analysis and deployment recommendation in `/root/report.md`. + +Use available tools to examine the environment. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__detect-project/solution/solve.sh b/evaluation/with_skills/rh-developer__detect-project/solution/solve.sh new file mode 100644 index 00000000..700e7ad4 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/solution/solve.sh @@ -0,0 +1,37 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Project Detection Report + +## Repository: /root/project + +### Detection Methodology +Scanned for indicator files: requirements.txt, package.json, pom.xml, go.mod, Gemfile. +Found: `requirements.txt` → Python project. + +### Detected Type +- **Language**: Python +- **Indicator**: `requirements.txt` found +- **Framework**: Flask (detected from `from flask import Flask` in app.py) +- **Entry Point**: `app.py` with `app = Flask(__name__)` + +### Helm Chart Search +Searched locations: ./Chart.yaml, ./chart/Chart.yaml, ./charts/*/Chart.yaml, ./helm/Chart.yaml, ./deploy/helm/Chart.yaml +Result: No Helm chart found — S2I or Dockerfile strategy recommended. + +### S2I Python Configuration +- **APP_MODULE**: `app:app` (module `app` from `app.py`, WSGI callable `app`) +- **gunicorn** is present in `requirements.txt` — required for the S2I Python builder to serve via APP_MODULE +- S2I Python builder uses gunicorn as the WSGI server when APP_MODULE is set + +### Recommended Builder Image +`registry.access.redhat.com/ubi9/python-39` (UBI base image) + +### Health Checks +- Add `/health` and `/ready` endpoints for OpenShift liveness/readiness probes + +### Recommended Deployment Strategy +1. **Primary**: S2I with `ubi9/python-39` builder image + - Set `APP_MODULE=app:app` in BuildConfig sourceStrategy.env + - Ensure gunicorn is in requirements.txt +2. **Alternative**: Containerize with Dockerfile using UBI base image +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__detect-project/task.toml b/evaluation/with_skills/rh-developer__detect-project/task.toml new file mode 100644 index 00000000..78be6504 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__detect-project" +name = "rh-developer Project Detection Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "detect-project", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__detect-project/tests/llm_judge.py b/evaluation/with_skills/rh-developer__detect-project/tests/llm_judge.py new file mode 100644 index 00000000..67b69834 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/tests/llm_judge.py @@ -0,0 +1,102 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "s2i_entry_point_sequence", + "file": "/root/report.md", + "question": "Does the report describe the S2I Python builder's entry point detection order — specifically mentioning that the builder checks for files like app.sh before falling back to app.py, and how app.py being the default entry point affects startup?", + "reference": "A skilled report describes the S2I Python startup sequence (check app.sh first, then application.py, then app.py) and explains that since app.py is found, gunicorn will serve it automatically. An unskilled report mentions app.py as the entry point without describing the detection sequence the builder follows." + }, + { + "id": "app_module_gunicorn_link", + "file": "/root/report.md", + "question": "Does the report explain the connection between gunicorn in requirements.txt and APP_MODULE configuration for the S2I Python builder — specifically that gunicorn is required for APP_MODULE to work?", + "reference": "A skilled report connects gunicorn to APP_MODULE, explaining that the S2I Python builder needs gunicorn in requirements.txt to serve the app specified by APP_MODULE (e.g., app:app). An unskilled report mentions gunicorn as a generic web server without connecting it to S2I builder mechanics." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__detect-project/tests/test.sh b/evaluation/with_skills/rh-developer__detect-project/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__detect-project/tests/test_outputs.py b/evaluation/with_skills/rh-developer__detect-project/tests/test_outputs.py new file mode 100644 index 00000000..3da3a2dc --- /dev/null +++ b/evaluation/with_skills/rh-developer__detect-project/tests/test_outputs.py @@ -0,0 +1,79 @@ +""" +Tests for rh-developer__detect-project per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_project_or_language(self): + content = read_report().lower() + assert any(t in content for t in ["project", "language", "framework", "detect"]), ( + "report should mention project detection" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 100, "report should have substantial content" + + +class TestSkillDependent: + def test_s2i_deployment_recommendation(self): + """Skill teaches S2I as preferred deployment for OpenShift.""" + c = read_report().lower() + assert "s2i" in c or "source-to-image" in c or "source to image" in c, ( + "should recommend S2I as deployment strategy for OpenShift" + ) + + def test_app_module_format(self): + """Skill teaches APP_MODULE format 'module:callable' (e.g., app:app) for + S2I Python. Without skill, agents don't know this configuration.""" + c = read_report().lower() + assert "app_module" in c and any(t in c for t in [ + "app:app", "module:", ":app", "module:callable", "wsgi", + ]), "should specify APP_MODULE format (e.g., app:app) for S2I Python" + + def test_gunicorn_s2i_link(self): + """Skill teaches gunicorn is required IN requirements.txt for the S2I + Python builder to use APP_MODULE. Without skill, agents mention gunicorn + generically without connecting it to S2I builder requirements.""" + c = read_report().lower() + assert "gunicorn" in c and ("s2i" in c or "app_module" in c or "builder" in c), ( + "should connect gunicorn to S2I/APP_MODULE (not just as a generic server)" + ) + + def test_ubi_base_image_recommendation(self): + """Skill teaches UBI as the base image for OpenShift.""" + c = read_report().lower() + assert "ubi" in c or "universal base image" in c, ( + "should recommend UBI base image for OpenShift deployment" + ) + + def test_s2i_entry_point_detection(self): + """Skill teaches the S2I Python entry point detection order + (app.sh → application.py → app.py). Without skill, agents don't + describe the builder's startup sequence.""" + c = read_report().lower() + has_sequence = "app.sh" in c + has_default_entry = ("default" in c or "entry point" in c) and "app.py" in c + has_startup = any(t in c for t in [ + "startup logic", "startup sequence", "s2i startup", + "entry point detection", "entry point order", + ]) + assert has_sequence or has_default_entry or has_startup, ( + "should describe S2I Python entry point detection (app.sh/app.py sequence)" + ) diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/Dockerfile b/evaluation/with_skills/rh-developer__helm-deploy/environment/Dockerfile new file mode 100644 index 00000000..f0cfbbda --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/Dockerfile @@ -0,0 +1,74 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "helm": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-helm-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-helm-mcp.py b/evaluation/with_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-helm-mcp.py new file mode 100644 index 00000000..8909ad01 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-helm-mcp.py @@ -0,0 +1,231 @@ +#!/usr/bin/env python3 +""" +Mock Helm MCP Server for rh-developer helm-deploy benchmark task. + +Simulates Helm CLI operations for OpenShift deployment planning. +""" + +from typing import Optional + +from fastmcp import FastMCP + +mcp = FastMCP("helm") + +# Mock data for existing releases +MOCK_RELEASES = [ + { + "name": "api-service", + "namespace": "api-platform", + "revision": 3, + "updated": "2026-02-15T10:30:00Z", + "status": "deployed", + "chart": "api-service-1.2.0", + "app_version": "1.0.0", + }, + { + "name": "web-frontend", + "namespace": "web-frontend", + "revision": 1, + "updated": "2026-02-14T14:20:00Z", + "status": "deployed", + "chart": "web-frontend-0.1.0", + "app_version": "1.0.0", + }, +] + +MOCK_CHART_METADATA = { + "name": "my-app", + "version": "0.1.0", + "appVersion": "1.0.0", + "description": "OpenShift deployment chart for my-app", + "keywords": ["openshift", "deployment"], + "maintainers": [{"name": "Red Hat", "email": "openshift@redhat.com"}], +} + +MOCK_DEFAULT_VALUES = """replicaCount: 1 + +image: + repository: quay.io/example/my-app + tag: latest + pullPolicy: IfNotPresent + +service: + type: ClusterIP + port: 8080 + +route: + enabled: true + host: "" + +resources: + limits: + cpu: 500m + memory: 512Mi + requests: + cpu: 100m + memory: 256Mi +""" + +MOCK_RENDERED_YAML = """--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: my-app + labels: + app: my-app +spec: + replicas: 1 + selector: + matchLabels: + app: my-app + template: + metadata: + labels: + app: my-app + spec: + containers: + - name: my-app + image: quay.io/example/my-app:latest + ports: + - containerPort: 8080 +--- +apiVersion: v1 +kind: Service +metadata: + name: my-app +spec: + ports: + - port: 8080 + targetPort: 8080 + selector: + app: my-app +--- +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: my-app +spec: + to: + kind: Service + name: my-app + port: + targetPort: 8080 +""" + + +@mcp.tool +def helm_list(namespace: str) -> dict: + """List installed Helm releases in a namespace. + + Args: + namespace: The Kubernetes/OpenShift namespace to list releases from. + """ + releases = [r for r in MOCK_RELEASES if r["namespace"] == namespace] + return { + "releases": releases, + "count": len(releases), + "namespace": namespace, + } + + +@mcp.tool +def helm_show_chart(chart: str) -> dict: + """Show chart metadata (name, version, description). + + Args: + chart: Path to chart directory or chart name (e.g. ./chart or my-chart). + """ + return { + "chart": chart, + "metadata": MOCK_CHART_METADATA, + } + + +@mcp.tool +def helm_show_values(chart: str) -> dict: + """Show default values for a chart. + + Args: + chart: Path to chart directory or chart name. + """ + return { + "chart": chart, + "values": MOCK_DEFAULT_VALUES, + } + + +@mcp.tool +def helm_template( + release_name: str, + chart: str, + namespace: str, + values: Optional[str] = None, +) -> dict: + """Render chart templates to YAML with given values. + + Args: + release_name: Name for the release. + chart: Path to chart directory. + namespace: Target namespace. + values: Optional YAML string of values to override defaults. + """ + return { + "release_name": release_name, + "chart": chart, + "namespace": namespace, + "rendered": MOCK_RENDERED_YAML, + } + + +@mcp.tool +def helm_install_dry_run( + release_name: str, + chart: str, + namespace: str, + values: Optional[str] = None, +) -> dict: + """Simulate helm install (dry-run) to validate before deploying. + + Args: + release_name: Name for the release. + chart: Path to chart directory. + namespace: Target namespace. + values: Optional YAML string of values to override defaults. + """ + return { + "release_name": release_name, + "chart": chart, + "namespace": namespace, + "dry_run": True, + "status": "would_create", + "resources": ["Deployment/my-app", "Service/my-app", "Route/my-app"], + } + + +@mcp.tool +def helm_status(release_name: str, namespace: str) -> dict: + """Get status of an installed Helm release. + + Args: + release_name: Name of the release. + namespace: The namespace where the release is installed. + """ + release = next( + (r for r in MOCK_RELEASES if r["name"] == release_name and r["namespace"] == namespace), + None, + ) + if release: + return { + "release": release_name, + "namespace": namespace, + "status": release, + } + return { + "release": release_name, + "namespace": namespace, + "error": f"Release '{release_name}' not found in namespace '{namespace}'", + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/buildconfig.yaml.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/buildconfig.yaml.template new file mode 100644 index 00000000..b3294eb2 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/buildconfig.yaml.template @@ -0,0 +1,38 @@ +apiVersion: build.openshift.io/v1 +kind: BuildConfig +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: build + app.kubernetes.io/part-of: ${APP_NAME} +spec: + source: + type: Git + git: + uri: ${GIT_URL} + ref: ${GIT_BRANCH} + strategy: + type: Source + sourceStrategy: + from: + kind: DockerImage + name: ${BUILDER_IMAGE} + env: [] + output: + to: + kind: ImageStreamTag + name: ${APP_NAME}:latest + triggers: + - type: ConfigChange + - type: ImageChange + runPolicy: Serial + resources: + limits: + memory: "1Gi" + cpu: "1" + requests: + memory: "512Mi" + cpu: "500m" diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/deployment.yaml.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/deployment.yaml.template new file mode 100644 index 00000000..eb3b481a --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: application + app.kubernetes.io/part-of: ${APP_NAME} + annotations: + image.openshift.io/triggers: | + [{"from":{"kind":"ImageStreamTag","name":"${APP_NAME}:latest"},"fieldPath":"spec.template.spec.containers[0].image"}] +spec: + replicas: ${REPLICAS} + selector: + matchLabels: + app: ${APP_NAME} + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + template: + metadata: + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + spec: + containers: + - name: ${APP_NAME} + image: image-registry.openshift-image-registry.svc:5000/${NAMESPACE}/${APP_NAME}:latest + ports: + - containerPort: ${CONTAINER_PORT} + protocol: TCP + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + livenessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + readinessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 + env: [] + restartPolicy: Always + terminationGracePeriodSeconds: 30 diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/Chart.yaml.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/Chart.yaml.template new file mode 100644 index 00000000..1aa22dd1 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/Chart.yaml.template @@ -0,0 +1,13 @@ +apiVersion: v2 +name: ${APP_NAME} +description: ${APP_DESCRIPTION} +type: application +version: 0.1.0 +appVersion: "${APP_VERSION}" +keywords: + - ${LANGUAGE} + - ${FRAMEWORK} + - openshift +maintainers: + - name: ${MAINTAINER_NAME} + email: ${MAINTAINER_EMAIL} diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/NOTES.txt.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/NOTES.txt.template new file mode 100644 index 00000000..154e628d --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/NOTES.txt.template @@ -0,0 +1,32 @@ +Congratulations! Your application {{ include "${APP_NAME}.fullname" . }} has been deployed. + +{{- if .Values.route.enabled }} + +Access your application at: +{{- if .Values.route.host }} + https://{{ .Values.route.host }} +{{- else }} + Run: oc get route {{ include "${APP_NAME}.fullname" . }} -o jsonpath='{.spec.host}' +{{- end }} + +{{- else }} + +Your application is available internally at: + {{ include "${APP_NAME}.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.service.port }} + +To expose it externally, create a Route or set route.enabled=true. + +{{- end }} + +Useful commands: + # View pods + oc get pods -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} + + # View logs + oc logs -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} -f + + # Upgrade release + helm upgrade {{ .Release.Name }} ./{{ .Chart.Name }} -f values.yaml + + # Uninstall release + helm uninstall {{ .Release.Name }} diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/_helpers.tpl.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/_helpers.tpl.template new file mode 100644 index 00000000..15873b10 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/_helpers.tpl.template @@ -0,0 +1,60 @@ +{{/* +Expand the name of the chart. +*/}} +{{- define "${APP_NAME}.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Create a default fully qualified app name. +*/}} +{{- define "${APP_NAME}.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{/* +Create chart name and version as used by the chart label. +*/}} +{{- define "${APP_NAME}.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Common labels +*/}} +{{- define "${APP_NAME}.labels" -}} +helm.sh/chart: {{ include "${APP_NAME}.chart" . }} +{{ include "${APP_NAME}.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{/* +Selector labels +*/}} +{{- define "${APP_NAME}.selectorLabels" -}} +app.kubernetes.io/name: {{ include "${APP_NAME}.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{/* +Create the name of the service account to use +*/}} +{{- define "${APP_NAME}.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "${APP_NAME}.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/deployment.yaml.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/deployment.yaml.template new file mode 100644 index 00000000..a6cbd868 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + selector: + matchLabels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 8 }} + spec: + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + serviceAccountName: {{ include "${APP_NAME}.serviceAccountName" . }} + securityContext: + {{- toYaml .Values.podSecurityContext | nindent 8 }} + containers: + - name: {{ .Chart.Name }} + securityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + ports: + - name: http + containerPort: {{ .Values.service.port }} + protocol: TCP + livenessProbe: + {{- toYaml .Values.livenessProbe | nindent 12 }} + readinessProbe: + {{- toYaml .Values.readinessProbe | nindent 12 }} + resources: + {{- toYaml .Values.resources | nindent 12 }} + {{- with .Values.env }} + env: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/route.yaml.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/route.yaml.template new file mode 100644 index 00000000..e2bab29a --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/route.yaml.template @@ -0,0 +1,24 @@ +{{- if .Values.route.enabled }} +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if .Values.route.host }} + host: {{ .Values.route.host }} + {{- end }} + to: + kind: Service + name: {{ include "${APP_NAME}.fullname" . }} + weight: 100 + port: + targetPort: http + {{- with .Values.route.tls }} + tls: + termination: {{ .termination }} + insecureEdgeTerminationPolicy: {{ .insecureEdgeTerminationPolicy }} + {{- end }} + wildcardPolicy: None +{{- end }} diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/service.yaml.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/service.yaml.template new file mode 100644 index 00000000..837bc888 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/templates/service.yaml.template @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Service +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + - port: {{ .Values.service.port }} + targetPort: http + protocol: TCP + name: http + selector: + {{- include "${APP_NAME}.selectorLabels" . | nindent 4 }} diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/values.yaml.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/values.yaml.template new file mode 100644 index 00000000..1cca6017 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/helm/values.yaml.template @@ -0,0 +1,67 @@ +# Default values for ${APP_NAME} +replicaCount: 1 + +image: + repository: ${IMAGE_REPOSITORY} + pullPolicy: IfNotPresent + tag: "${IMAGE_TAG}" + +imagePullSecrets: [] +nameOverride: "" +fullnameOverride: "" + +serviceAccount: + create: true + annotations: {} + name: "" + +podAnnotations: {} +podSecurityContext: {} +securityContext: {} + +service: + type: ClusterIP + port: ${CONTAINER_PORT} + +route: + enabled: true + host: "" + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + +resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + +livenessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 30 + periodSeconds: 10 + +readinessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 5 + periodSeconds: 5 + +autoscaling: + enabled: false + minReplicas: 1 + maxReplicas: 5 + targetCPUUtilizationPercentage: 80 + +nodeSelector: {} +tolerations: [] +affinity: {} + +env: [] +# - name: MY_VAR +# value: "my-value" diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/imagestream.yaml.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/imagestream.yaml.template new file mode 100644 index 00000000..46572193 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/imagestream.yaml.template @@ -0,0 +1,13 @@ +apiVersion: image.openshift.io/v1 +kind: ImageStream +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: image + app.kubernetes.io/part-of: ${APP_NAME} +spec: + lookupPolicy: + local: false diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/route.yaml.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/route.yaml.template new file mode 100644 index 00000000..7c53d2e7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/route.yaml.template @@ -0,0 +1,21 @@ +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: route + app.kubernetes.io/part-of: ${APP_NAME} +spec: + to: + kind: Service + name: ${APP_NAME} + weight: 100 + port: + targetPort: http + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/service.yaml.template b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/service.yaml.template new file mode 100644 index 00000000..7e1cf371 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/service.yaml.template @@ -0,0 +1,20 @@ +apiVersion: v1 +kind: Service +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: service + app.kubernetes.io/part-of: ${APP_NAME} +spec: + selector: + app: ${APP_NAME} + ports: + - name: http + port: ${CONTAINER_PORT} + targetPort: ${CONTAINER_PORT} + protocol: TCP + type: ClusterIP + sessionAffinity: None diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootful.service b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootful.service new file mode 100644 index 00000000..c1e8fe8f --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootful.service @@ -0,0 +1,27 @@ +# Rootful Podman container managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootless.service b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootless.service new file mode 100644 index 00000000..ca9dc371 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootless.service @@ -0,0 +1,27 @@ +# Rootless Podman container managed by systemd (user service) +# Location: ~/.config/systemd/user/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=default.target diff --git a/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-native.service b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-native.service new file mode 100644 index 00000000..c55cfc07 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-native.service @@ -0,0 +1,39 @@ +# Native application managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${SERVICE_USER} - User to run the service as +# ${APP_PATH} - Application install path (e.g., /opt/app-name) +# ${PORT} - Application listen port +# ${START_COMMAND} - Application start command +# +# Start command examples by language: +# Node.js: /usr/bin/node ${APP_PATH}/server.js +# Python: /usr/bin/python3 ${APP_PATH}/app.py +# Java: /usr/bin/java -jar ${APP_PATH}/app.jar +# Go: ${APP_PATH}/binary-name + +[Unit] +Description=${APP_NAME} Service +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=${SERVICE_USER} +WorkingDirectory=${APP_PATH} +Environment=PORT=${PORT} +ExecStart=${START_COMMAND} +Restart=always +RestartSec=5 + +# Security hardening +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +ReadWritePaths=${APP_PATH} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/with_skills/rh-developer__helm-deploy/instruction.md b/evaluation/with_skills/rh-developer__helm-deploy/instruction.md new file mode 100644 index 00000000..5ea35a0f --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/instruction.md @@ -0,0 +1,12 @@ +# Helm Deployment Task + +You are a Red Hat developer. Plan the deployment of an application using Helm charts on OpenShift. + +## Requirements +- Evaluate or create a Helm chart structure +- Configure values for the target environment +- Address OpenShift-specific considerations + +Use MCP tools to examine the cluster. Document your methodology, chart configuration, and deployment plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__helm-deploy/solution/solve.sh b/evaluation/with_skills/rh-developer__helm-deploy/solution/solve.sh new file mode 100644 index 00000000..caf0f768 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/solution/solve.sh @@ -0,0 +1,31 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Helm Deployment Plan + +## Chart Location +Searched: ./Chart.yaml, ./chart/Chart.yaml, ./charts/*/Chart.yaml, ./helm/Chart.yaml +Found: `./chart/Chart.yaml` + +## Values Override +```yaml +replicaCount: 2 +image: + repository: image-registry.openshift-image-registry.svc:5000/myproject/myapp + tag: latest +service: + port: 8080 +resources: + limits: + memory: 512Mi +``` + +## Deploy Command +```bash +helm install myapp ./chart/ -f values-override.yaml -n myproject +``` + +## Quick Commands +helm status myapp -n myproject +helm history myapp -n myproject +helm rollback myapp 1 -n myproject +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__helm-deploy/task.toml b/evaluation/with_skills/rh-developer__helm-deploy/task.toml new file mode 100644 index 00000000..89f35c82 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__helm-deploy" +name = "rh-developer Helm Deployment Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "helm-deploy", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__helm-deploy/tests/llm_judge.py b/evaluation/with_skills/rh-developer__helm-deploy/tests/llm_judge.py new file mode 100644 index 00000000..5632c542 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/tests/llm_judge.py @@ -0,0 +1,102 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "openshift_helm_considerations", + "file": "/root/report.md", + "question": "Does the report address OpenShift-specific Helm concerns like Route vs Ingress and SecurityContextConstraints?", + "reference": "A skilled report addresses that OpenShift uses Routes and has SCC requirements that may affect Helm charts designed for vanilla Kubernetes. An unskilled report treats the chart as platform-agnostic." + }, + { + "id": "buildconfig_in_chart", + "file": "/root/report.md", + "question": "Does the report describe including an OpenShift BuildConfig template as part of the Helm chart structure, so that the chart manages the build pipeline alongside the deployment?", + "reference": "A skilled report includes a BuildConfig YAML template inside the Helm chart (e.g., templates/buildconfig.yaml) for S2I builds. An unskilled report assumes pre-built images and does not integrate build pipelines into the chart." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__helm-deploy/tests/test.sh b/evaluation/with_skills/rh-developer__helm-deploy/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__helm-deploy/tests/test_outputs.py b/evaluation/with_skills/rh-developer__helm-deploy/tests/test_outputs.py new file mode 100644 index 00000000..2f4af59c --- /dev/null +++ b/evaluation/with_skills/rh-developer__helm-deploy/tests/test_outputs.py @@ -0,0 +1,61 @@ +""" +Tests for rh-developer__helm-deploy per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: OpenShift-Helm integration (not generic Helm knowledge). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_helm(self): + content = read_report().lower() + assert "helm" in content, "report should mention Helm" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 100, "report should have substantial content" + + +class TestSkillDependent: + def test_values_customization(self): + """Customizing values before deployment.""" + c = read_report().lower() + assert any(t in c for t in ["values", "override", "set", "customize"]) and any(t in c for t in [ + "install", "upgrade", "deploy" + ]), "should address values customization" + + def test_openshift_considerations(self): + """OpenShift-specific Helm considerations (Route, SCC).""" + c = read_report().lower() + assert any(t in c for t in ["openshift", "route", "scc", "security"]), ( + "should address OpenShift-specific Helm concerns" + ) + + def test_buildconfig_integration(self): + """OpenShift BuildConfig integration in Helm charts for S2I builds. + Without skill, agents use static image references.""" + c = read_report() + assert "BuildConfig" in c or "buildconfig" in c.lower() or "build.openshift.io" in c, ( + "should address OpenShift BuildConfig integration in Helm deployment" + ) + + def test_s2i_in_helm_chart(self): + """OpenShift S2I build integration as part of the Helm chart, + so the chart manages both the build and deploy lifecycle.""" + c = read_report().lower() + assert ("s2i" in c or "source-to-image" in c or "source to image" in c) and ( + "helm" in c or "chart" in c or "template" in c + ), "should integrate S2I builds within the Helm chart structure" diff --git a/evaluation/with_skills/rh-developer__recommend-image/environment/Dockerfile b/evaluation/with_skills/rh-developer__recommend-image/environment/Dockerfile new file mode 100644 index 00000000..1cbfefcf --- /dev/null +++ b/evaluation/with_skills/rh-developer__recommend-image/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__recommend-image/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__recommend-image/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__recommend-image/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__recommend-image/instruction.md b/evaluation/with_skills/rh-developer__recommend-image/instruction.md new file mode 100644 index 00000000..7d5e0138 --- /dev/null +++ b/evaluation/with_skills/rh-developer__recommend-image/instruction.md @@ -0,0 +1,13 @@ +# Image Recommendation Task + +You are a Red Hat developer. Your team is choosing a container base image for a production Python application. The image must be secure, supported, and appropriately sized. + +## Requirements +- Evaluate the available base images that support the application's language and runtime +- Compare at least two candidate images on: security posture (CVE exposure, update cadence), image size, vendor support lifecycle, and compatibility with the application's dependencies +- Recommend a specific image with clear justification for why it is the best fit +- Note any trade-offs or caveats with the recommendation (e.g., larger size for better compatibility) + +Document your analysis and recommendation in `/root/report.md`. + +Use available tools to examine the environment. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__recommend-image/solution/solve.sh b/evaluation/with_skills/rh-developer__recommend-image/solution/solve.sh new file mode 100644 index 00000000..ccbb9f6c --- /dev/null +++ b/evaluation/with_skills/rh-developer__recommend-image/solution/solve.sh @@ -0,0 +1,18 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Image Recommendations + +## Use Case Assessment +Production: prefer Minimal/Runtime. Development: prefer Full variant. + +## 1. Python 3.11 Flask API +**Image**: `registry.access.redhat.com/ubi9/python-311` +**Variant**: Full (build tools needed for pip install) +**Verify**: `skopeo inspect docker://registry.access.redhat.com/ubi9/python-311` + +## 2. Java 17 Quarkus (pre-built JAR) +**Image**: `registry.access.redhat.com/ubi9/openjdk-17-runtime` +**Variant**: Runtime (no build tools, smaller attack surface, faster startup) +**Rationale**: Pre-built JAR doesn't need compilation tools. Runtime variant is ~60% smaller. Security: reduced attack surface. +**Verify**: `skopeo inspect docker://registry.access.redhat.com/ubi9/openjdk-17-runtime` +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__recommend-image/task.toml b/evaluation/with_skills/rh-developer__recommend-image/task.toml new file mode 100644 index 00000000..2888fbf5 --- /dev/null +++ b/evaluation/with_skills/rh-developer__recommend-image/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__recommend-image" +name = "rh-developer Image Recommendation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "recommend-image", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__recommend-image/tests/llm_judge.py b/evaluation/with_skills/rh-developer__recommend-image/tests/llm_judge.py new file mode 100644 index 00000000..1d03045e --- /dev/null +++ b/evaluation/with_skills/rh-developer__recommend-image/tests/llm_judge.py @@ -0,0 +1,102 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "remote_image_inspection", + "file": "/root/report.md", + "question": "Does the report discuss a remote image inspection approach (such as skopeo inspect docker://) for querying image metadata without pulling the full image?", + "reference": "A skilled report discusses using skopeo or a similar remote inspection approach to verify image metadata (size, architecture, build date) without pulling. If skopeo is unavailable, the report should still mention it as the recommended tool or note that static reference data was used instead. An unskilled report only considers pulling images locally with podman/docker." + }, + { + "id": "variant_tradeoffs", + "file": "/root/report.md", + "question": "Does the report compare at least two image variant categories (e.g., Full/build-tools vs Minimal/secure vs Runtime/smallest) with explicit trade-offs for each?", + "reference": "A skilled report distinguishes image variant categories and explains trade-offs (size vs tools vs security). An unskilled report recommends one image without comparing alternatives." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__recommend-image/tests/test.sh b/evaluation/with_skills/rh-developer__recommend-image/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__recommend-image/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__recommend-image/tests/test_outputs.py b/evaluation/with_skills/rh-developer__recommend-image/tests/test_outputs.py new file mode 100644 index 00000000..00dfabc3 --- /dev/null +++ b/evaluation/with_skills/rh-developer__recommend-image/tests/test_outputs.py @@ -0,0 +1,66 @@ +""" +Tests for rh-developer__recommend-image per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_image(self): + content = read_report().lower() + assert "image" in content, "report should mention container images" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 100, "report should have substantial content" + + +class TestSkillDependent: + def test_remote_image_inspection_approach(self): + """Skill teaches skopeo inspect docker:// for remote image inspection. + Without skill, agents only consider local podman/docker pull.""" + c = read_report().lower() + assert any(t in c for t in [ + "skopeo", "remote inspect", "registry inspect", + "docker://", "image metadata", "without pulling" + ]), "should discuss remote image inspection approach (e.g., skopeo, registry API)" + + def test_image_variant_categories(self): + """Skill teaches three variant categories: Full (build tools), Minimal + (smaller/secure), Runtime (smallest, no build tools). Without skill, + agents don't distinguish these categories.""" + c = read_report().lower() + variants = ["full", "minimal", "runtime"] + mentioned = sum(1 for v in variants if v in c) + assert mentioned >= 2, ( + "should compare image variant categories (Full, Minimal, Runtime)" + ) + + def test_security_data_awareness(self): + """Skill teaches Red Hat Security Data API for CVE/security status per image. + Without skill, agents skip security posture evaluation.""" + c = read_report().lower() + assert any(t in c for t in ["security data", "cve", "vulnerability", "security api"]) and any(t in c for t in [ + "image", "scan", "check", "posture", "red hat" + ]), "should address security/CVE posture for image selection" + + def test_ubi_registry_awareness(self): + """Skill teaches UBI images from registry.access.redhat.com.""" + c = read_report().lower() + assert any(t in c for t in ["ubi", "red hat", "registry"]) and any(t in c for t in [ + "python", "node", "java", "image" + ]), "should recommend UBI images from Red Hat registry" diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/Dockerfile b/evaluation/with_skills/rh-developer__rhel-deploy/environment/Dockerfile new file mode 100644 index 00000000..e5e4879b --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/Dockerfile @@ -0,0 +1,74 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhel-host": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhel-host-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-rhel-host-mcp.py b/evaluation/with_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-rhel-host-mcp.py new file mode 100644 index 00000000..f10dd2f8 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-rhel-host-mcp.py @@ -0,0 +1,230 @@ +#!/usr/bin/env python3 +""" +Mock RHEL Host MCP Server for rh-developer rhel-deploy benchmark task. + +Simulates a RHEL 9.3 host with Podman 4.9.4 for container deployment planning. +Scenario: Deploy a Flask app container as a systemd service on port 8080. +""" + +from typing import Optional + +from fastmcp import FastMCP + +mcp = FastMCP("rhel-host") + +# Mock state +MOCK_SYSTEM_INFO = { + "os": "Red Hat Enterprise Linux 9.3 (Plow)", + "kernel": "5.14.0-362.18.1.el9_3.x86_64", + "architecture": "x86_64", + "podman_version": "podman version 4.9.4", + "selinux": "Enforcing", + "firewall": "running", +} + +MOCK_OPEN_PORTS = {8080} # Port 8080 opened for Flask app +MOCK_SERVICES = { + "flask-app": { + "name": "flask-app", + "active": "active", + "state": "running", + "enabled": True, + "description": "Flask application container", + }, + "container-flask-app": { + "name": "container-flask-app", + "active": "active", + "state": "running", + "enabled": True, + "description": "Podman container flask-app.service", + }, +} + +MOCK_PODMAN_PS = """CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +a1b2c3d4e5f6 quay.io/ubi9/python-311:latest flask run 2 hours ago Up 2 hours ago 0.0.0.0:8080->8080/tcp flask-app +""" + +MOCK_PODMAN_INSPECT = """[ + { + "Id": "a1b2c3d4e5f6", + "Name": "flask-app", + "State": { + "Status": "running", + "Running": true + }, + "Config": { + "Image": "quay.io/ubi9/python-311:latest", + "Cmd": ["flask", "run", "--host=0.0.0.0", "--port=8080"] + }, + "HostConfig": { + "PortBindings": { + "8080/tcp": [{"HostPort": "8080"}] + } + } + } +] +""" + + +def _match_command(cmd: str) -> Optional[str]: + """Return a command category for pattern matching.""" + cmd_lower = cmd.strip().lower() + if "podman pull" in cmd_lower: + return "podman_pull" + if "podman run" in cmd_lower: + return "podman_run" + if "podman ps" in cmd_lower or cmd_lower == "podman ps": + return "podman_ps" + if "podman inspect" in cmd_lower: + return "podman_inspect" + if "systemctl enable" in cmd_lower: + return "systemctl_enable" + if "systemctl start" in cmd_lower: + return "systemctl_start" + if "systemctl status" in cmd_lower: + return "systemctl_status" + if "firewall-cmd" in cmd_lower: + return "firewall_cmd" + if "semanage fcontext" in cmd_lower: + return "semanage_fcontext" + if "restorecon" in cmd_lower: + return "restorecon" + return None + + +@mcp.tool +def run_command(command: str) -> dict: + """Simulate running a shell command on a RHEL host. + + Supports common deployment patterns: podman, systemctl, firewall-cmd, semanage. + Returns realistic output for supported commands; error for unknown commands. + + Args: + command: The shell command to execute (e.g. 'podman ps', 'systemctl status flask-app'). + """ + kind = _match_command(command) + if kind == "podman_pull": + return { + "command": command, + "exit_code": 0, + "stdout": "Trying to pull quay.io/ubi9/python-311:latest...\nGetting image source signatures\nCopying blob sha256:...\nCopying config sha256:...\nWriting manifest to image destination\nStoring signatures\n", + "stderr": "", + } + if kind == "podman_run": + return { + "command": command, + "exit_code": 0, + "stdout": "a1b2c3d4e5f6", + "stderr": "", + } + if kind == "podman_ps": + return { + "command": command, + "exit_code": 0, + "stdout": MOCK_PODMAN_PS, + "stderr": "", + } + if kind == "podman_inspect": + return { + "command": command, + "exit_code": 0, + "stdout": MOCK_PODMAN_INSPECT, + "stderr": "", + } + if kind == "systemctl_enable": + return { + "command": command, + "exit_code": 0, + "stdout": "", + "stderr": "", + } + if kind == "systemctl_start": + return { + "command": command, + "exit_code": 0, + "stdout": "", + "stderr": "", + } + if kind == "systemctl_status": + return { + "command": command, + "exit_code": 0, + "stdout": """● flask-app.service - Flask application container + Loaded: loaded (/etc/systemd/system/flask-app.service; enabled) + Active: active (running) since Tue 2026-03-17 10:00:00 UTC; 2h ago + Main PID: 1234 (conmon) + Tasks: 8 + Memory: 128.0M + CGroup: /system.slice/flask-app.service +""", + "stderr": "", + } + if kind == "firewall_cmd": + return { + "command": command, + "exit_code": 0, + "stdout": "success\n", + "stderr": "", + } + if kind == "semanage_fcontext": + return { + "command": command, + "exit_code": 0, + "stdout": "", + "stderr": "", + } + if kind == "restorecon": + return { + "command": command, + "exit_code": 0, + "stdout": "", + "stderr": "", + } + return { + "command": command, + "exit_code": 1, + "stdout": "", + "stderr": f"Error: Unknown or unsupported command. Supported: podman pull/run/ps/inspect, systemctl enable/start/status, firewall-cmd, semanage fcontext, restorecon.", + } + + +@mcp.tool +def get_system_info() -> dict: + """Return RHEL version, architecture, and Podman version for the target host.""" + return MOCK_SYSTEM_INFO.copy() + + +@mcp.tool +def check_service(name: str) -> dict: + """Return systemd service status for a given service name. + + Args: + name: Service name (e.g. 'flask-app', 'container-flask-app'). + """ + svc = MOCK_SERVICES.get(name) + if svc: + return {"service": name, "status": svc, "found": True} + return { + "service": name, + "found": False, + "error": f"Service '{name}' not found. Known services: {list(MOCK_SERVICES.keys())}", + } + + +@mcp.tool +def check_port(port: int) -> dict: + """Return whether a port is open in the firewall. + + Args: + port: Port number to check (e.g. 8080). + """ + open_port = port in MOCK_OPEN_PORTS + return { + "port": port, + "open": open_port, + "message": f"Port {port} is {'open' if open_port else 'closed'} in firewall.", + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/buildconfig.yaml.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/buildconfig.yaml.template new file mode 100644 index 00000000..b3294eb2 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/buildconfig.yaml.template @@ -0,0 +1,38 @@ +apiVersion: build.openshift.io/v1 +kind: BuildConfig +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: build + app.kubernetes.io/part-of: ${APP_NAME} +spec: + source: + type: Git + git: + uri: ${GIT_URL} + ref: ${GIT_BRANCH} + strategy: + type: Source + sourceStrategy: + from: + kind: DockerImage + name: ${BUILDER_IMAGE} + env: [] + output: + to: + kind: ImageStreamTag + name: ${APP_NAME}:latest + triggers: + - type: ConfigChange + - type: ImageChange + runPolicy: Serial + resources: + limits: + memory: "1Gi" + cpu: "1" + requests: + memory: "512Mi" + cpu: "500m" diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/deployment.yaml.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/deployment.yaml.template new file mode 100644 index 00000000..eb3b481a --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: application + app.kubernetes.io/part-of: ${APP_NAME} + annotations: + image.openshift.io/triggers: | + [{"from":{"kind":"ImageStreamTag","name":"${APP_NAME}:latest"},"fieldPath":"spec.template.spec.containers[0].image"}] +spec: + replicas: ${REPLICAS} + selector: + matchLabels: + app: ${APP_NAME} + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + template: + metadata: + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + spec: + containers: + - name: ${APP_NAME} + image: image-registry.openshift-image-registry.svc:5000/${NAMESPACE}/${APP_NAME}:latest + ports: + - containerPort: ${CONTAINER_PORT} + protocol: TCP + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + livenessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + readinessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 + env: [] + restartPolicy: Always + terminationGracePeriodSeconds: 30 diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/Chart.yaml.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/Chart.yaml.template new file mode 100644 index 00000000..1aa22dd1 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/Chart.yaml.template @@ -0,0 +1,13 @@ +apiVersion: v2 +name: ${APP_NAME} +description: ${APP_DESCRIPTION} +type: application +version: 0.1.0 +appVersion: "${APP_VERSION}" +keywords: + - ${LANGUAGE} + - ${FRAMEWORK} + - openshift +maintainers: + - name: ${MAINTAINER_NAME} + email: ${MAINTAINER_EMAIL} diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/NOTES.txt.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/NOTES.txt.template new file mode 100644 index 00000000..154e628d --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/NOTES.txt.template @@ -0,0 +1,32 @@ +Congratulations! Your application {{ include "${APP_NAME}.fullname" . }} has been deployed. + +{{- if .Values.route.enabled }} + +Access your application at: +{{- if .Values.route.host }} + https://{{ .Values.route.host }} +{{- else }} + Run: oc get route {{ include "${APP_NAME}.fullname" . }} -o jsonpath='{.spec.host}' +{{- end }} + +{{- else }} + +Your application is available internally at: + {{ include "${APP_NAME}.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.service.port }} + +To expose it externally, create a Route or set route.enabled=true. + +{{- end }} + +Useful commands: + # View pods + oc get pods -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} + + # View logs + oc logs -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} -f + + # Upgrade release + helm upgrade {{ .Release.Name }} ./{{ .Chart.Name }} -f values.yaml + + # Uninstall release + helm uninstall {{ .Release.Name }} diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/_helpers.tpl.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/_helpers.tpl.template new file mode 100644 index 00000000..15873b10 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/_helpers.tpl.template @@ -0,0 +1,60 @@ +{{/* +Expand the name of the chart. +*/}} +{{- define "${APP_NAME}.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Create a default fully qualified app name. +*/}} +{{- define "${APP_NAME}.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{/* +Create chart name and version as used by the chart label. +*/}} +{{- define "${APP_NAME}.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Common labels +*/}} +{{- define "${APP_NAME}.labels" -}} +helm.sh/chart: {{ include "${APP_NAME}.chart" . }} +{{ include "${APP_NAME}.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{/* +Selector labels +*/}} +{{- define "${APP_NAME}.selectorLabels" -}} +app.kubernetes.io/name: {{ include "${APP_NAME}.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{/* +Create the name of the service account to use +*/}} +{{- define "${APP_NAME}.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "${APP_NAME}.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/deployment.yaml.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/deployment.yaml.template new file mode 100644 index 00000000..a6cbd868 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + selector: + matchLabels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 8 }} + spec: + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + serviceAccountName: {{ include "${APP_NAME}.serviceAccountName" . }} + securityContext: + {{- toYaml .Values.podSecurityContext | nindent 8 }} + containers: + - name: {{ .Chart.Name }} + securityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + ports: + - name: http + containerPort: {{ .Values.service.port }} + protocol: TCP + livenessProbe: + {{- toYaml .Values.livenessProbe | nindent 12 }} + readinessProbe: + {{- toYaml .Values.readinessProbe | nindent 12 }} + resources: + {{- toYaml .Values.resources | nindent 12 }} + {{- with .Values.env }} + env: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/route.yaml.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/route.yaml.template new file mode 100644 index 00000000..e2bab29a --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/route.yaml.template @@ -0,0 +1,24 @@ +{{- if .Values.route.enabled }} +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if .Values.route.host }} + host: {{ .Values.route.host }} + {{- end }} + to: + kind: Service + name: {{ include "${APP_NAME}.fullname" . }} + weight: 100 + port: + targetPort: http + {{- with .Values.route.tls }} + tls: + termination: {{ .termination }} + insecureEdgeTerminationPolicy: {{ .insecureEdgeTerminationPolicy }} + {{- end }} + wildcardPolicy: None +{{- end }} diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/service.yaml.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/service.yaml.template new file mode 100644 index 00000000..837bc888 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/service.yaml.template @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Service +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + - port: {{ .Values.service.port }} + targetPort: http + protocol: TCP + name: http + selector: + {{- include "${APP_NAME}.selectorLabels" . | nindent 4 }} diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/values.yaml.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/values.yaml.template new file mode 100644 index 00000000..1cca6017 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/helm/values.yaml.template @@ -0,0 +1,67 @@ +# Default values for ${APP_NAME} +replicaCount: 1 + +image: + repository: ${IMAGE_REPOSITORY} + pullPolicy: IfNotPresent + tag: "${IMAGE_TAG}" + +imagePullSecrets: [] +nameOverride: "" +fullnameOverride: "" + +serviceAccount: + create: true + annotations: {} + name: "" + +podAnnotations: {} +podSecurityContext: {} +securityContext: {} + +service: + type: ClusterIP + port: ${CONTAINER_PORT} + +route: + enabled: true + host: "" + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + +resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + +livenessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 30 + periodSeconds: 10 + +readinessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 5 + periodSeconds: 5 + +autoscaling: + enabled: false + minReplicas: 1 + maxReplicas: 5 + targetCPUUtilizationPercentage: 80 + +nodeSelector: {} +tolerations: [] +affinity: {} + +env: [] +# - name: MY_VAR +# value: "my-value" diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/imagestream.yaml.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/imagestream.yaml.template new file mode 100644 index 00000000..46572193 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/imagestream.yaml.template @@ -0,0 +1,13 @@ +apiVersion: image.openshift.io/v1 +kind: ImageStream +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: image + app.kubernetes.io/part-of: ${APP_NAME} +spec: + lookupPolicy: + local: false diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/route.yaml.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/route.yaml.template new file mode 100644 index 00000000..7c53d2e7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/route.yaml.template @@ -0,0 +1,21 @@ +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: route + app.kubernetes.io/part-of: ${APP_NAME} +spec: + to: + kind: Service + name: ${APP_NAME} + weight: 100 + port: + targetPort: http + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/service.yaml.template b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/service.yaml.template new file mode 100644 index 00000000..7e1cf371 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/service.yaml.template @@ -0,0 +1,20 @@ +apiVersion: v1 +kind: Service +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: service + app.kubernetes.io/part-of: ${APP_NAME} +spec: + selector: + app: ${APP_NAME} + ports: + - name: http + port: ${CONTAINER_PORT} + targetPort: ${CONTAINER_PORT} + protocol: TCP + type: ClusterIP + sessionAffinity: None diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootful.service b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootful.service new file mode 100644 index 00000000..c1e8fe8f --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootful.service @@ -0,0 +1,27 @@ +# Rootful Podman container managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootless.service b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootless.service new file mode 100644 index 00000000..ca9dc371 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootless.service @@ -0,0 +1,27 @@ +# Rootless Podman container managed by systemd (user service) +# Location: ~/.config/systemd/user/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=default.target diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-native.service b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-native.service new file mode 100644 index 00000000..c55cfc07 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-native.service @@ -0,0 +1,39 @@ +# Native application managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${SERVICE_USER} - User to run the service as +# ${APP_PATH} - Application install path (e.g., /opt/app-name) +# ${PORT} - Application listen port +# ${START_COMMAND} - Application start command +# +# Start command examples by language: +# Node.js: /usr/bin/node ${APP_PATH}/server.js +# Python: /usr/bin/python3 ${APP_PATH}/app.py +# Java: /usr/bin/java -jar ${APP_PATH}/app.jar +# Go: ${APP_PATH}/binary-name + +[Unit] +Description=${APP_NAME} Service +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=${SERVICE_USER} +WorkingDirectory=${APP_PATH} +Environment=PORT=${PORT} +ExecStart=${START_COMMAND} +Restart=always +RestartSec=5 + +# Security hardening +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +ReadWritePaths=${APP_PATH} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/instruction.md b/evaluation/with_skills/rh-developer__rhel-deploy/instruction.md new file mode 100644 index 00000000..b7c3a70e --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/instruction.md @@ -0,0 +1,12 @@ +# RHEL Deployment Task + +You are a Red Hat developer. Plan the deployment of a containerized application on RHEL using Podman and systemd. + +## Requirements +- Configure the container to run as a systemd service +- Address security hardening (SELinux, privilege restrictions) +- Include volume mounts and networking configuration + +Use available tools to examine the environment. Document your methodology, configuration, and deployment plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/solution/solve.sh b/evaluation/with_skills/rh-developer__rhel-deploy/solution/solve.sh new file mode 100644 index 00000000..cf537860 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/solution/solve.sh @@ -0,0 +1,43 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# RHEL Deployment Plan + +## Rootless Podman Setup +```bash +sudo useradd -m appuser +sudo loginctl enable-linger appuser +``` + +## Container Run +```bash +podman run -d --name flask-app -p 8080:5000 -v /opt/app-data:/data:z flask-app:latest +``` + +## Systemd Service +Path: `~/.config/systemd/user/flask-app.service` +```ini +[Unit] +Description=Flask App Container +[Service] +ExecStart=/usr/bin/podman run --rm --name flask-app -p 8080:5000 -v /opt/app-data:/data:Z flask-app:latest +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +[Install] +WantedBy=default.target +``` + +## Firewall +```bash +sudo firewall-cmd --permanent --add-port=8080/tcp +sudo firewall-cmd --reload +``` + +## SELinux +```bash +sudo semanage port -a -t http_port_t -p tcp 8080 +sudo semanage fcontext -a -t container_file_t '/opt/app-data(/.*)?' +sudo restorecon -Rv /opt/app-data +``` +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/task.toml b/evaluation/with_skills/rh-developer__rhel-deploy/task.toml new file mode 100644 index 00000000..0ac61da9 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__rhel-deploy" +name = "rh-developer RHEL Deployment Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "rhel-deploy", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/tests/llm_judge.py b/evaluation/with_skills/rh-developer__rhel-deploy/tests/llm_judge.py new file mode 100644 index 00000000..5d7ba0df --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "selinux_volume_labels", + "file": "/root/report.md", + "question": "Does the report explain SELinux volume labels :z (shared, multi-container) and :Z (private) for Podman bind mounts?", + "reference": "A skilled report uses :z or :Z suffixes on volume mounts and explains the difference. An unskilled report skips SELinux mount context." + }, + { + "id": "rootless_systemd", + "file": "/root/report.md", + "question": "Does the report address rootless systemd service configuration (~/.config/systemd/user/) and loginctl enable-linger?", + "reference": "A skilled report shows the rootless systemd path and explains enable-linger for services to survive logout. An unskilled report only shows rootful /etc/systemd/system/ paths." + }, + { + "id": "semanage_fcontext_restorecon", + "file": "/root/report.md", + "question": "Does the report use semanage fcontext + restorecon for setting SELinux file contexts on application directories?", + "reference": "A skilled report uses 'semanage fcontext -a -t bin_t' plus 'restorecon -Rv' for app files. An unskilled report skips file-level SELinux context management." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/tests/test.sh b/evaluation/with_skills/rh-developer__rhel-deploy/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__rhel-deploy/tests/test_outputs.py b/evaluation/with_skills/rh-developer__rhel-deploy/tests/test_outputs.py new file mode 100644 index 00000000..b4a1c092 --- /dev/null +++ b/evaluation/with_skills/rh-developer__rhel-deploy/tests/test_outputs.py @@ -0,0 +1,98 @@ +""" +Tests for rh-developer__rhel-deploy per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_rhel_or_podman(self): + content = read_report().lower() + assert "rhel" in content or "podman" in content, "report should mention RHEL or Podman" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_selinux_volume_labels(self): + """Skill teaches SELinux volume labels: :z = shared (relabeled for multi-container), + :Z = private. Without skill, agents skip SELinux mount context.""" + c = read_report() + assert ":z" in c or ":Z" in c or "selinux" in c.lower(), ( + "should address SELinux volume labels (:z shared, :Z private)" + ) + + def test_rootless_systemd_path(self): + """Skill teaches rootless systemd service location ~/.config/systemd/user/ + vs /etc/systemd/system/ for rootful. Without skill, agents only know rootful.""" + c = read_report() + assert ".config/systemd/user" in c or "rootless" in c.lower(), ( + "should address rootless systemd path (~/.config/systemd/user/)" + ) + + def test_enable_linger(self): + """Skill teaches loginctl enable-linger required for rootless user services + to survive logout. Without skill, agents miss this requirement.""" + c = read_report().lower() + assert "enable-linger" in c or "loginctl" in c or "linger" in c, ( + "should mention loginctl enable-linger for rootless services" + ) + + def test_semanage_fcontext(self): + """Skill teaches semanage fcontext + restorecon for setting SELinux context + on application files. Without skill, agents skip file context management.""" + c = read_report().lower() + assert ("semanage fcontext" in c or "semanage" in c) and ( + "restorecon" in c or "fcontext" in c + ), "should use semanage fcontext + restorecon for file SELinux context" + + def test_firewall_port(self): + """Skill teaches firewall-cmd for opening application ports.""" + c = read_report().lower() + assert "firewall-cmd" in c or ("firewall" in c and "port" in c), ( + "should address firewall port configuration" + ) + + def test_systemd_hardening_directives(self): + """Docs teach systemd hardening directives: NoNewPrivileges=true, + ProtectSystem=strict, ReadWritePaths. Without docs, agents create basic + unit files without security hardening.""" + c = read_report() + assert any(t in c for t in [ + "NoNewPrivileges", "ProtectSystem", "ReadWritePaths", + "PrivateTmp", "ProtectHome", + ]) or "hardening" in c.lower(), ( + "should include systemd hardening directives (NoNewPrivileges, ProtectSystem)" + ) + + def test_container_security_practices(self): + """Skill teaches defence-in-depth for containers: dropping capabilities, + resource limits, read-only root, security options. Without skill, + agents deploy containers with default security settings.""" + c = read_report().lower() + practices = sum(1 for t in [ + "cap-drop", "cap_drop", "capability", + "--read-only", "read-only root", + "resource limit", "memory", "cpus", + "no-new-privileges", "security-opt", + ] if t in c) + assert practices >= 2, ( + "should address at least 2 container security practices " + "(capability dropping, resource limits, read-only root, security options)" + ) diff --git a/evaluation/with_skills/rh-developer__s2i-build/environment/Dockerfile b/evaluation/with_skills/rh-developer__s2i-build/environment/Dockerfile new file mode 100644 index 00000000..1cbfefcf --- /dev/null +++ b/evaluation/with_skills/rh-developer__s2i-build/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__s2i-build/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__s2i-build/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__s2i-build/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__s2i-build/instruction.md b/evaluation/with_skills/rh-developer__s2i-build/instruction.md new file mode 100644 index 00000000..107967b9 --- /dev/null +++ b/evaluation/with_skills/rh-developer__s2i-build/instruction.md @@ -0,0 +1,12 @@ +# S2I Build Configuration Task + +You are a Red Hat developer. Configure a Source-to-Image (S2I) build for a Python web application. + +## Requirements +- Select the appropriate builder image +- Configure the build process and entry point +- Address application startup configuration + +Use MCP tools to examine the cluster. Document your methodology, configuration, and build plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__s2i-build/solution/solve.sh b/evaluation/with_skills/rh-developer__s2i-build/solution/solve.sh new file mode 100644 index 00000000..a25acec6 --- /dev/null +++ b/evaluation/with_skills/rh-developer__s2i-build/solution/solve.sh @@ -0,0 +1,60 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# S2I Build Configuration + +## Problem +Python Flask app uses `main.py` as entry point, not the default `app.py`. + +## Solution +1. Create ImageStream for output image +2. Create BuildConfig with `APP_MODULE=main:app` in `sourceStrategy.env` +3. Ensure `gunicorn` is in `requirements.txt` + +### ImageStream +```yaml +apiVersion: image.openshift.io/v1 +kind: ImageStream +metadata: + name: flask-app + labels: + app: flask-app +spec: + lookupPolicy: + local: false +``` + +### BuildConfig +```yaml +apiVersion: build.openshift.io/v1 +kind: BuildConfig +metadata: + name: flask-app +spec: + source: + type: Git + git: + uri: https://github.com/example/flask-app + strategy: + type: Source + sourceStrategy: + from: + kind: ImageStreamTag + name: python:3.11-ubi9 + namespace: openshift + env: + - name: APP_MODULE + value: "main:app" + output: + to: + kind: ImageStreamTag + name: flask-app:latest +``` + +### S2I Build Phases +- **Assemble**: Install dependencies from requirements.txt (including gunicorn), compile assets. Customizable via `.s2i/bin/assemble`. +- **Run**: Start the application using gunicorn with APP_MODULE. Customizable via `.s2i/bin/run`. + +### Why APP_MODULE is needed +S2I Python startup sequence: app.sh → gunicorn+APP_MODULE → app.py → ERROR +Since entry is main.py (not app.py), gunicorn must be installed and APP_MODULE must point to main:app. +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__s2i-build/task.toml b/evaluation/with_skills/rh-developer__s2i-build/task.toml new file mode 100644 index 00000000..8dedc143 --- /dev/null +++ b/evaluation/with_skills/rh-developer__s2i-build/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__s2i-build" +name = "rh-developer S2I Build Configuration Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "s2i-build", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__s2i-build/tests/llm_judge.py b/evaluation/with_skills/rh-developer__s2i-build/tests/llm_judge.py new file mode 100644 index 00000000..5fbc562a --- /dev/null +++ b/evaluation/with_skills/rh-developer__s2i-build/tests/llm_judge.py @@ -0,0 +1,114 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "app_module_in_buildconfig", + "file": "/root/report.md", + "question": "Does the report specify that APP_MODULE should be set in the BuildConfig's sourceStrategy.env section (not as a generic environment variable), using the module:callable format (e.g., app:app or main:app)?", + "reference": "A skilled report places APP_MODULE in sourceStrategy.env of the BuildConfig YAML, using the module:callable format. An unskilled report mentions APP_MODULE generically without specifying its placement in sourceStrategy.env." + }, + { + "id": "s2i_build_phases", + "file": "/root/report.md", + "question": "Does the report explain S2I build phases (assemble for dependency installation and compilation, run for application startup) and how they can be customized via .s2i/bin/ scripts?", + "reference": "A skilled report explains the assemble and run phases and mentions .s2i/bin/assemble or .s2i/bin/run for customization. An unskilled report treats S2I as a monolithic process." + }, + { + "id": "gunicorn_dependency", + "file": "/root/report.md", + "question": "Does the report explicitly state that gunicorn must be in requirements.txt specifically BECAUSE the S2I Python builder uses gunicorn to serve the application specified by APP_MODULE?", + "reference": "A skilled report identifies gunicorn as a required dependency for Python S2I with APP_MODULE. An unskilled report doesn't link gunicorn to the entry point mechanism." + }, + { + "id": "imagestream_as_separate_resource", + "file": "/root/report.md", + "question": "Does the report include a standalone ImageStream YAML manifest (with apiVersion: image.openshift.io/v1 and kind: ImageStream) as a separate resource definition, rather than only referencing ImageStreamTag within the BuildConfig output section?", + "reference": "A skilled report defines the ImageStream as its own YAML resource with apiVersion: image.openshift.io/v1, kind: ImageStream, and lookupPolicy configuration, created as a prerequisite before the BuildConfig. An unskilled report only references ImageStreamTag as an output target in the BuildConfig but does not show the ImageStream resource definition." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__s2i-build/tests/test.sh b/evaluation/with_skills/rh-developer__s2i-build/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__s2i-build/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__s2i-build/tests/test_outputs.py b/evaluation/with_skills/rh-developer__s2i-build/tests/test_outputs.py new file mode 100644 index 00000000..ec2af10d --- /dev/null +++ b/evaluation/with_skills/rh-developer__s2i-build/tests/test_outputs.py @@ -0,0 +1,84 @@ +""" +Tests for rh-developer__s2i-build per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_s2i(self): + content = read_report().lower() + assert "s2i" in content or "source-to-image" in content, ( + "report should mention S2I" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_app_module_format(self): + """Skill teaches APP_MODULE env var format module:app (e.g. main:app) for + non-default Python entry points. Without skill, agents don't know this format.""" + c = read_report() + assert "APP_MODULE" in c or "app_module" in c.lower(), ( + "should reference APP_MODULE env var for Python S2I entry point" + ) + + def test_module_colon_app_syntax(self): + """Skill teaches the module:app syntax (e.g., main:app, wsgi:application). + Without skill, agents don't know the colon-separated format.""" + c = read_report() + assert any(t in c for t in ["main:app", "wsgi:app", "module:app", ":app", ":application"]) or ( + "APP_MODULE" in c and ":" in c + ), "should show module:app format for APP_MODULE" + + def test_s2i_build_phases(self): + """Skill teaches S2I build phases: assemble (install deps, compile) and + run (start app). Without skill, agents treat S2I as a black box.""" + c = read_report().lower() + assert ("assemble" in c and ("run" in c or "start" in c)) or ( + "build phase" in c or "build step" in c or "build process" in c + ), "should explain S2I build phases (assemble and run)" + + def test_buildconfig_imagestream(self): + """Skill teaches creating ImageStream + BuildConfig with source/builder/output.""" + c = read_report().lower() + assert any(t in c for t in ["buildconfig", "imagestream", "build config"]) and any(t in c for t in [ + "source", "builder", "output" + ]), "should define BuildConfig/ImageStream" + + def test_gunicorn_requirement(self): + """Skill teaches gunicorn must be in requirements.txt for APP_MODULE.""" + c = read_report().lower() + assert "gunicorn" in c and any(t in c for t in [ + "requirements", "pip", "install", "wsgi", "app_module" + ]), "should address gunicorn requirement for S2I Python" + + def test_standalone_imagestream_yaml(self): + """Skill teaches creating ImageStream as a separate resource with + image.openshift.io/v1 API group and lookupPolicy. Without skill, + agents reference ImageStreamTag in BuildConfig but don't define + the ImageStream resource itself.""" + c = read_report() + has_is_api = "image.openshift.io" in c + has_lookup = "lookupPolicy" in c + assert has_is_api or has_lookup, ( + "should define ImageStream resource with image.openshift.io API" + ) + diff --git a/evaluation/with_skills/rh-developer__validate-environment/environment/Dockerfile b/evaluation/with_skills/rh-developer__validate-environment/environment/Dockerfile new file mode 100644 index 00000000..1cbfefcf --- /dev/null +++ b/evaluation/with_skills/rh-developer__validate-environment/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-developer__validate-environment/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/with_skills/rh-developer__validate-environment/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/with_skills/rh-developer__validate-environment/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-developer__validate-environment/instruction.md b/evaluation/with_skills/rh-developer__validate-environment/instruction.md new file mode 100644 index 00000000..b9024f98 --- /dev/null +++ b/evaluation/with_skills/rh-developer__validate-environment/instruction.md @@ -0,0 +1,13 @@ +# Environment Validation Task + +You are a Red Hat developer. Before deploying a new application, you need to confirm the OpenShift environment is ready and properly configured. + +## Requirements +- Verify cluster connectivity: confirm you can reach the API server and authenticate successfully +- Check namespace readiness: does the target namespace exist, and do you have permissions to create deployments, services, and routes in it? +- Verify resource availability: are there sufficient CPU and memory quotas remaining for a new deployment? +- Produce a readiness checklist with pass/fail status for each check and an overall go/no-go recommendation + +Document your validation results and readiness assessment in `/root/report.md`. + +Use MCP tools to examine the cluster. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-developer__validate-environment/solution/solve.sh b/evaluation/with_skills/rh-developer__validate-environment/solution/solve.sh new file mode 100644 index 00000000..3cb34892 --- /dev/null +++ b/evaluation/with_skills/rh-developer__validate-environment/solution/solve.sh @@ -0,0 +1,36 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Environment Validation Report + +## Validation Scope: All +(Options: All, OpenShift, RHEL/Containers, Minimal) + +### Tool Availability +| Tool | Status | Version | +|------|--------|---------| +| git | OK | 2.43.0 | +| curl | OK | 8.5.0 | +| jq | OK | 1.7.1 | +| oc | OK | 4.14.0 | +| helm | OK | 3.14.0 | +| podman | OK | 4.9.0 | +| skopeo | OK | 1.14.0 | +| ssh | OK | OpenSSH 9.6 | + +Status indicators: OK (working), MISSING (not in PATH), WARN (optional missing). + +### OpenShift Permissions (oc auth can-i) +| Resource | Action | Status | +|----------|--------|--------| +| deployments | create | OK | +| buildconfigs | create | OK | +| imagestreams | create | OK | + +### Connectivity +- Cluster: Connected (`oc whoami` → admin) +- Podman info: `podman info --format '{{.Host.OS}} {{.Host.Arch}}'` → linux amd64 + +### Ready for +/detect-project, /s2i-build, /deploy, /helm-deploy, /containerize-deploy + +REPORT_EOF diff --git a/evaluation/with_skills/rh-developer__validate-environment/task.toml b/evaluation/with_skills/rh-developer__validate-environment/task.toml new file mode 100644 index 00000000..10df7267 --- /dev/null +++ b/evaluation/with_skills/rh-developer__validate-environment/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__validate-environment" +name = "rh-developer Environment Validation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "validate-environment", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-developer__validate-environment/tests/llm_judge.py b/evaluation/with_skills/rh-developer__validate-environment/tests/llm_judge.py new file mode 100644 index 00000000..3545ef59 --- /dev/null +++ b/evaluation/with_skills/rh-developer__validate-environment/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "skopeo_validation", + "file": "/root/report.md", + "question": "Does the report validate skopeo as a required tool in the environment?", + "reference": "A skilled report checks that skopeo is installed and available, as it is needed for image recommendation workflows. An unskilled report skips skopeo validation." + }, + { + "id": "rbac_permission_checks", + "file": "/root/report.md", + "question": "Does the report verify that the user has permissions to create deployments, buildconfigs, or imagestreams in the target namespace?", + "reference": "A skilled report checks create permissions for deployments, buildconfigs, and imagestreams — either via 'oc auth can-i' commands or via MCP/API queries that verify the same permissions. An unskilled report only checks identity (oc whoami) without verifying specific resource permissions." + }, + { + "id": "structured_validation_report", + "file": "/root/report.md", + "question": "Does the report present environment validation results in a structured pass/fail format with remediation for failures?", + "reference": "A skilled report presents each check as pass/fail with an overall go/no-go assessment and remediation steps. An unskilled report lists tools without structured evaluation." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-developer__validate-environment/tests/test.sh b/evaluation/with_skills/rh-developer__validate-environment/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-developer__validate-environment/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-developer__validate-environment/tests/test_outputs.py b/evaluation/with_skills/rh-developer__validate-environment/tests/test_outputs.py new file mode 100644 index 00000000..8f62b808 --- /dev/null +++ b/evaluation/with_skills/rh-developer__validate-environment/tests/test_outputs.py @@ -0,0 +1,67 @@ +""" +Tests for rh-developer__validate-environment per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_environment(self): + content = read_report().lower() + assert any(t in content for t in ["environment", "cluster", "ready", "validation"]), ( + "report should mention environment validation" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 100, "report should have substantial content" + + +class TestSkillDependent: + def test_skopeo_as_required_tool(self): + """Skill teaches skopeo is a required dependency for image recommendation flows. + Without skill, agents skip skopeo in environment validation.""" + c = read_report().lower() + assert "skopeo" in c, ( + "should validate skopeo as a required tool" + ) + + def test_oc_auth_can_i_checks(self): + """Skill teaches oc auth can-i create deployments/buildconfigs/imagestreams + for permission checks. Without skill, agents only check oc whoami.""" + c = read_report().lower() + has_permission_method = ("auth can-i" in c or "can-i" in c or "permission" in c) + has_resource_type = any(t in c for t in [ + "deployment", "buildconfig", "imagestream", "create" + ]) + assert has_permission_method and has_resource_type, ( + "should verify create permissions for deployments/buildconfigs/imagestreams" + ) + + def test_tool_version_checks(self): + """Skill teaches checking version/availability of oc, helm, podman, git.""" + c = read_report().lower() + tools = ["oc", "helm", "podman", "git", "skopeo"] + mentioned = sum(1 for t in tools if t in c) + assert mentioned >= 3, "should validate multiple CLI tools" + + def test_structured_pass_fail(self): + """Skill teaches presenting results as pass/fail per check.""" + c = read_report().lower() + assert any(t in c for t in ["pass", "fail", "missing", "go", "no-go", "available"]) and any(t in c for t in [ + "tool", "check", "oc", "helm", "result" + ]), "should provide structured pass/fail validation report" diff --git a/evaluation/with_skills/rh-sre__cve-impact/environment/Dockerfile b/evaluation/with_skills/rh-sre__cve-impact/environment/Dockerfile new file mode 100644 index 00000000..484ebb33 --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-impact/environment/Dockerfile @@ -0,0 +1,52 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__cve-impact/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__cve-impact/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-impact/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__cve-impact/instruction.md b/evaluation/with_skills/rh-sre__cve-impact/instruction.md new file mode 100644 index 00000000..00b38e1d --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-impact/instruction.md @@ -0,0 +1,14 @@ +# CVE Impact Analysis Task + +You are a Red Hat SRE. A critical vulnerability has been announced, and management needs to know how many of your systems are affected before deciding on emergency patching. + +## Requirements +- Query your fleet to identify all systems affected by the CVE +- Break down the impact by environment (production vs staging vs development) and by RHEL version +- Report total affected system count, and flag any high-criticality systems (e.g., customer-facing, compliance-regulated) +- If results span multiple pages, ensure you capture the complete picture +- Provide a risk summary: severity, exposure scope, and recommended urgency level + +Document your methodology, impact analysis, and risk assessment in `/root/report.md`. + +Use MCP tools to query vulnerability data. If reference documentation or skills are available in this environment, consult them before beginning work. Complete the entire analysis autonomously — do not stop to ask for user confirmation or input at any checkpoint. Use reasonable defaults (e.g., fetch all available data) and proceed through every step to produce the final report. diff --git a/evaluation/with_skills/rh-sre__cve-impact/solution/solve.sh b/evaluation/with_skills/rh-sre__cve-impact/solution/solve.sh new file mode 100644 index 00000000..fbbfb891 --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-impact/solution/solve.sh @@ -0,0 +1,15 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# CVE Impact Analysis + +## CVE-2024-12345 +- Severity: Critical (CVSS 9.8) +- Affected systems: 6 +- Patched: 2 +- Vulnerable: 4 + +## Pagination +Used limit=100 per page, system_uuid for system-level queries. First page only often returns 0 remediatable CVEs—systems may have 1700+ CVEs (~18 API calls). Recommend "all pages" for remediatable queries. + +## Data parsed using cve-response-parser.py +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__cve-impact/task.toml b/evaluation/with_skills/rh-sre__cve-impact/task.toml new file mode 100644 index 00000000..1ef53278 --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-impact/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__cve-impact" +name = "rh-sre CVE Impact Analysis Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "cve-impact", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__cve-impact/tests/llm_judge.py b/evaluation/with_skills/rh-sre__cve-impact/tests/llm_judge.py new file mode 100644 index 00000000..91bf4254 --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-impact/tests/llm_judge.py @@ -0,0 +1,94 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "comprehensive_cve_coverage", "file": "/root/report.md", "question": "Does the report analyze multiple CVEs from the fleet inventory (at least 3 distinct CVE IDs) with per-CVE detail, rather than focusing only on a single CVE?", "reference": "A skilled report comprehensively assesses all CVEs affecting the fleet (CVE-2024-12345, 54321, 98765, 11111, 22222) with individual analysis. An unskilled report often focuses only on the primary CVE-2024-12345."}, + {"id": "multi_environment_breakdown", "file": "/root/report.md", "question": "Does the report break down CVE impact across at least 3 distinct environment tiers (e.g., production, staging, development, QA, legacy) with per-environment system counts or status?", "reference": "A skilled report categorizes affected systems by environment tier (production, staging, dev, QA, legacy) with counts per environment. An unskilled report provides aggregate totals without environment-level detail."}, + {"id": "risk_assessment", "file": "/root/report.md", "question": "Does the report provide a risk assessment that considers multiple factors such as CVSS score, affected system count, and environment criticality?", "reference": "A skilled report includes a multi-factor risk assessment. An unskilled report gives generic severity ratings without combining multiple factors."}, + {"id": "operational_priority_ranking", "file": "/root/report.md", "question": "Does the report assign explicit operational priority tiers (like P0/P1/P2 or Priority 1/2/3) to CVEs with associated SLA timeframes (e.g., within 24 hours, within 7 days), going beyond just severity labels?", "reference": "A skilled report maps CVE severity to operational priority tiers (P0=immediate/24h, P1=7 days, P2=30 days) with concrete remediation deadlines. An unskilled report uses only vendor severity labels (Critical/Important/Moderate) without operational priority mapping."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__cve-impact/tests/test.sh b/evaluation/with_skills/rh-sre__cve-impact/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-impact/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__cve-impact/tests/test_outputs.py b/evaluation/with_skills/rh-sre__cve-impact/tests/test_outputs.py new file mode 100644 index 00000000..d5edc006 --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-impact/tests/test_outputs.py @@ -0,0 +1,92 @@ +""" +Tests for rh-sre__cve-impact per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_cve(self): + content = read_report().lower() + assert "cve" in content, "report should mention CVEs" + + def test_mentions_impact(self): + content = read_report().lower() + assert any(t in content for t in ["impact", "affected", "system", "fleet"]), ( + "report should discuss impact" + ) + + +class TestSkillDependent: + def test_full_cve_coverage(self): + """Skill teaches comprehensive fleet-wide CVE assessment across all CVEs. + Without skill, agents often focus only on the primary CVE.""" + c = read_report() + cve_ids = ["CVE-2024-12345", "CVE-2024-54321", "CVE-2024-98765", + "CVE-2024-11111", "CVE-2024-22222"] + found = sum(1 for cve in cve_ids if cve in c) + assert found >= 3, ( + f"should analyze multiple CVEs from fleet (found {found}/5); " + "skill teaches comprehensive multi-CVE assessment" + ) + + def test_prioritized_remediation_order(self): + """Skill teaches prioritizing CVEs with explicit priority ranking + (P0/P1/P2 or similar ordered tiers). Without skill, agents list by + severity without operational priority ranking.""" + c = read_report() + has_priority = any(t in c for t in [ + "P0", "P1", "P2", "Priority 0", "Priority 1", "Priority 2", + ]) or any(t in c.lower() for t in [ + "priority order", "remediation priority", "remediation order", + "triage priority", "priority ranking", "prioritized order", + ]) + assert has_priority, ( + "should assign explicit priority ranking (P0/P1/P2 or equivalent) to CVEs" + ) + + def test_multi_environment_breakdown(self): + """Skill teaches breaking down impact by environment (prod/staging/dev/QA/legacy). + Without skill, agents report aggregate counts without per-environment detail.""" + c = read_report().lower() + envs = ["production", "staging", "development", "qa", "legacy", "dev"] + found = sum(1 for e in envs if e in c) + assert found >= 3, ( + f"should break down impact across multiple environments (found {found}); " + "skill teaches per-environment categorization" + ) + + def test_risk_assessment_structure(self): + """Skill: Risk assessment with CVSS, affected count, environment criticality.""" + c = read_report().lower() + has_risk = any(t in c for t in ["risk", "priority", "urgency", "criticality"]) + has_factors = any(t in c for t in ["cvss", "affect", "severity", "count", "staging", "criticality"]) + assert has_risk and has_factors, ( + "should provide risk assessment with multiple factors (skill: Step 5)" + ) + + def test_classification_methodology(self): + """Skill teaches using classification criteria/methodology for CVE interpretation. + Without skill, agents classify severity ad-hoc.""" + c = read_report().lower() + assert any(t in c for t in [ + "classification", "methodology", "criteria", + "vulnerability-logic", "cvss-scoring", + "scoring framework", "risk framework", + ]) or ("consult" in c and "reference" in c), ( + "should reference classification methodology for CVE interpretation" + ) diff --git a/evaluation/with_skills/rh-sre__cve-validation/environment/Dockerfile b/evaluation/with_skills/rh-sre__cve-validation/environment/Dockerfile new file mode 100644 index 00000000..484ebb33 --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-validation/environment/Dockerfile @@ -0,0 +1,52 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__cve-validation/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__cve-validation/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-validation/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__cve-validation/instruction.md b/evaluation/with_skills/rh-sre__cve-validation/instruction.md new file mode 100644 index 00000000..27325f5c --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-validation/instruction.md @@ -0,0 +1,12 @@ +# CVE Validation Task + +You are a Red Hat SRE. Validate a set of CVEs to determine which are real, applicable, and remediable on your fleet. + +## Requirements +- Validate CVE identifiers and severity +- Determine which CVEs have available fixes or advisories +- Classify CVEs by remediation status + +Use MCP tools to query vulnerability data. Document your methodology, validation results, and classification in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. Complete the entire analysis autonomously — do not stop after preliminary steps like MCP validation. Proceed through CVE querying, validation, classification, and report generation without waiting for user input. diff --git a/evaluation/with_skills/rh-sre__cve-validation/solution/solve.sh b/evaluation/with_skills/rh-sre__cve-validation/solution/solve.sh new file mode 100644 index 00000000..f4350508 --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-validation/solution/solve.sh @@ -0,0 +1,14 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# CVE Validation Report + +## CVE-2024-12345 +- Format: Valid (^CVE-\d{4}-\d{4,7}$) +- Advisory available: Yes (advisory_available, advisories_list) +- Do NOT use rules[] for remediation decision +- Remediation status: automated_remediation_available +- Validation status: valid +- Severity: Critical (Red Hat) +- Affected packages: httpd 2.4.37-1.el8 → 2.4.37-2.el8 +- Priority: P0 (24 hours) +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__cve-validation/task.toml b/evaluation/with_skills/rh-sre__cve-validation/task.toml new file mode 100644 index 00000000..98d08db5 --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-validation/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__cve-validation" +name = "rh-sre CVE Validation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "cve-validation", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__cve-validation/tests/llm_judge.py b/evaluation/with_skills/rh-sre__cve-validation/tests/llm_judge.py new file mode 100644 index 00000000..f0df9c9c --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-validation/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "advisory_not_rules", "file": "/root/report.md", "question": "Does the report use advisory_available or advisories_list (not rules[]) to determine remediation availability?", "reference": "A skilled report checks advisory_available/advisories_list for remediation status. An unskilled report incorrectly uses rules[] which is the Advisor engine."}, + {"id": "format_validation", "file": "/root/report.md", "question": "Does the report validate CVE format and accept 4-7 digit sequence numbers?", "reference": "A skilled report accepts CVE IDs with 4-7 digit sequences. An unskilled report may reject valid CVEs with non-5-digit sequences."}, + {"id": "structured_output", "file": "/root/report.md", "question": "Does the report output validation_status and remediation availability in a structured format?", "reference": "A skilled report presents clear validation_status and automated_remediation_available fields."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__cve-validation/tests/test.sh b/evaluation/with_skills/rh-sre__cve-validation/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-validation/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__cve-validation/tests/test_outputs.py b/evaluation/with_skills/rh-sre__cve-validation/tests/test_outputs.py new file mode 100644 index 00000000..21b9262c --- /dev/null +++ b/evaluation/with_skills/rh-sre__cve-validation/tests/test_outputs.py @@ -0,0 +1,81 @@ +""" +Tests for rh-sre__cve-validation per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_cve(self): + content = read_report().lower() + assert "cve" in content, "report should mention CVEs" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_format_then_api_validation(self): + """Skill: Validate format (regex) first; if valid, ALWAYS call get_cve—do not reject on year/sequence.""" + c = read_report().lower() + has_format = any(t in c for t in ["regex", "pattern", "cve-", "cve-format", "year/sequence"]) + has_api_call = any(t in c for t in ["get_cve", "call", "api", "retrieve", "fetch"]) + assert has_format or has_api_call, ( + "should validate format then call get_cve (skill: do NOT reject on year/sequence before API)" + ) + + def test_advisory_available_not_rules(self): + """Skill teaches remediation determined by advisory_available/advisories_list/remediation field, NOT by rules[].""" + c = read_report().lower() + assert any(t in c for t in ["advisory_available", "advisories_list"]), ( + "should use advisory_available or advisories_list for remediation (skill: rules[] is wrong)" + ) + + def test_cve_regex_acceptance(self): + """Skill teaches CVE sequence is 4-7 digits (not always 5).""" + c = read_report().lower() + assert any(t in c for t in ["4,7", "4-7", "4-7 digit", "4 to 7", "regex"]), ( + "should accept CVE sequence 4-7 digits (skill: not always 5 digits)" + ) + + def test_validation_status_output(self): + """Skill: Return validation_status and remediation_status.automated_remediation_available.""" + c = read_report().lower() + has_status = any(t in c for t in ["validation_status", "valid", "invalid", "not_remediable"]) + has_remediation_flag = any(t in c for t in ["automated_remediation", "automated", "manual", "remediat"]) + assert has_status and has_remediation_flag, ( + "should output validation_status and remediation availability" + ) + + def test_affected_packages_with_versions(self): + """Skill: Identify affected packages with current and fixed versions.""" + c = read_report().lower() + has_packages = any(t in c for t in ["package", "affected", "component"]) + has_versions = any(t in c for t in ["version", "fixed", "current", "el8", "el9"]) + assert has_packages and has_versions, ( + "should identify packages with version info (skill: for playbook-generator)" + ) + + def test_remediation_field_value(self): + """Docs teach remediation==2 means automated remediation available. + Without docs, agents don't know the numeric remediation field semantics.""" + c = read_report().lower() + assert any(t in c for t in [ + "remediation==2", "remediation=2", "remediation field", "remediation value", + "automated remediation", + ]), "should interpret remediation field value (2=automated)" diff --git a/evaluation/with_skills/rh-sre__execution-summary/environment/Dockerfile b/evaluation/with_skills/rh-sre__execution-summary/environment/Dockerfile new file mode 100644 index 00000000..484ebb33 --- /dev/null +++ b/evaluation/with_skills/rh-sre__execution-summary/environment/Dockerfile @@ -0,0 +1,52 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__execution-summary/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__execution-summary/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/with_skills/rh-sre__execution-summary/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__execution-summary/instruction.md b/evaluation/with_skills/rh-sre__execution-summary/instruction.md new file mode 100644 index 00000000..5521bb63 --- /dev/null +++ b/evaluation/with_skills/rh-sre__execution-summary/instruction.md @@ -0,0 +1,15 @@ +# Execution Summary Task + +You are a Red Hat SRE. Your team just completed an emergency remediation of a critical CVE across your managed fleet. Management needs a structured post-incident execution summary. + +## Scenario +A critical kernel vulnerability was announced. Your team used automation tools to identify affected systems, generate remediation playbooks, execute patching, and verify the fix. Now you need to document what was done. + +## Requirements +- Use MCP tools to query the current state of the fleet, identify which systems were affected, and gather evidence of remediation actions taken +- Produce an execution summary that includes: what was done, which tools and automation were used, the sequence of steps, results and verification outcomes, and any remaining gaps +- Structure the summary so it can be reviewed by management and used for future incident response improvement + +Document the full execution summary, including your methodology and the tools used, in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-sre__execution-summary/solution/solve.sh b/evaluation/with_skills/rh-sre__execution-summary/solution/solve.sh new file mode 100644 index 00000000..68891309 --- /dev/null +++ b/evaluation/with_skills/rh-sre__execution-summary/solution/solve.sh @@ -0,0 +1,13 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Execution Summary + +**** EXECUTION SUMMARY START **** +Agents: None +Skills: rh-sre:fleet-inventory,rh-sre:cve-impact +Tools: lightspeed-mcp:get_host_details,lightspeed-mcp:get_cves +Docs: docs/references/cvss-scoring.md,docs/insights/vulnerability-logic.md +**** EXECUTION SUMMARY END **** + +This summary shows all agents, skills, tools, and documentation used during the workflow. +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__execution-summary/task.toml b/evaluation/with_skills/rh-sre__execution-summary/task.toml new file mode 100644 index 00000000..a983e99f --- /dev/null +++ b/evaluation/with_skills/rh-sre__execution-summary/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__execution-summary" +name = "rh-sre Execution Summary Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "execution-summary", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__execution-summary/tests/llm_judge.py b/evaluation/with_skills/rh-sre__execution-summary/tests/llm_judge.py new file mode 100644 index 00000000..c426d912 --- /dev/null +++ b/evaluation/with_skills/rh-sre__execution-summary/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "tool_purpose_attribution", "file": "/root/report.md", "question": "Does the report pair each tool or MCP call with an explicit purpose statement explaining why it was invoked, rather than just listing tools used or describing actions narratively?", "reference": "A skilled report explicitly pairs each tool (e.g., get_host_details, get_cves) with a 'Purpose:' or 'used to' statement. An unskilled report lists tools in a table or mentions them inline without structured per-tool purpose mapping."}, + {"id": "tool_purpose_categorization", "file": "/root/report.md", "question": "Does the report organize MCP tools into named functional categories (e.g., 'Vulnerability Assessment Tools', 'Fleet Management Tools', 'Remediation Tools') rather than listing all tools in a single flat list?", "reference": "A skilled report groups tools by their functional purpose into labeled sub-sections (e.g., 'Assessment Tools: get_cves, get_cve', 'Fleet Tools: get_host_details'). An unskilled report lists all tools sequentially without categorizing them by function."}, + {"id": "methodology_phases", "file": "/root/report.md", "question": "Does the report organize the execution into at least two named methodology phases (e.g., 'Data Collection', 'Evidence Gathering', 'Triage', 'Verification') as distinct sections or headings?", "reference": "A skilled report structures execution into named phases as headings or sections. An unskilled report presents a flat narrative or simple numbered list without phase-level organization."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__execution-summary/tests/test.sh b/evaluation/with_skills/rh-sre__execution-summary/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__execution-summary/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__execution-summary/tests/test_outputs.py b/evaluation/with_skills/rh-sre__execution-summary/tests/test_outputs.py new file mode 100644 index 00000000..6cd1228a --- /dev/null +++ b/evaluation/with_skills/rh-sre__execution-summary/tests/test_outputs.py @@ -0,0 +1,55 @@ +""" +Tests for rh-sre__execution-summary per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: skill-specific patterns (not generic report quality). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['summary', 'execution', 'remediation']), ( + "report should mention execution summary or remediation" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 300, "execution summary should be substantial" + + +class TestSkillDependent: + def test_methodology_phases(self): + """Structuring execution into methodology phases + (data collection, evidence gathering, etc.).""" + c = read_report().lower() + phase_terms = [ + "data collection", "evidence gathering", "discovery", + "triage", "assessment", "verification", + "phase 1", "phase 2", "step 1", "step 2", + ] + found = sum(1 for t in phase_terms if t in c) + assert found >= 2, ( + f"should organize execution into methodology phases (found {found})" + ) + + def test_docs_from_consulted(self): + """Extract docs from 'I consulted' statements; path from docs/ or skills/ onwards.""" + c = read_report().lower() + has_docs = any(t in c for t in ["docs/", "skills/", "consult", "documentation"]) + assert has_docs, ( + "should list documentation consulted" + ) diff --git a/evaluation/with_skills/rh-sre__fleet-inventory/environment/Dockerfile b/evaluation/with_skills/rh-sre__fleet-inventory/environment/Dockerfile new file mode 100644 index 00000000..484ebb33 --- /dev/null +++ b/evaluation/with_skills/rh-sre__fleet-inventory/environment/Dockerfile @@ -0,0 +1,52 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__fleet-inventory/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__fleet-inventory/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/with_skills/rh-sre__fleet-inventory/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__fleet-inventory/instruction.md b/evaluation/with_skills/rh-sre__fleet-inventory/instruction.md new file mode 100644 index 00000000..3074bb9c --- /dev/null +++ b/evaluation/with_skills/rh-sre__fleet-inventory/instruction.md @@ -0,0 +1,17 @@ +# Fleet Inventory Task + +You are a Red Hat SRE. Your manager has asked for a current snapshot of all RHEL systems in your managed fleet ahead of an upcoming compliance audit. + +## Scenario +The compliance team needs to know exactly what systems you manage, their RHEL versions, patch levels, and any outstanding vulnerability exposure. They need this by end of day. + +## Requirements +- Query the fleet to enumerate all managed RHEL systems +- For each system, report: hostname, RHEL version, last check-in date, and patch status +- Identify which systems have outstanding CVEs, grouped by severity +- Flag any systems that are stale (not checking in) or running unsupported RHEL versions +- Summarize the fleet's overall health and compliance readiness + +Document your methodology, findings, and fleet assessment in `/root/report.md`. + +Use MCP tools to query data. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-sre__fleet-inventory/solution/solve.sh b/evaluation/with_skills/rh-sre__fleet-inventory/solution/solve.sh new file mode 100644 index 00000000..dc994408 --- /dev/null +++ b/evaluation/with_skills/rh-sre__fleet-inventory/solution/solve.sh @@ -0,0 +1,25 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Fleet Inventory Report + +## Systems Summary +| Hostname | RHEL | Environment | Status | Last Seen | +|----------|------|-------------|--------|-----------| +| web-01 | 9.3 | Production | Active | 2024-01-15 | +| db-01 | 9.3 | Production | Active | 2024-01-15 | +| dev-01 | 8.9 | Development | Stale | 2024-01-01 | + +## Data Source +Queried via `get_host_details` with pagination. Key fields: rhel_version, tags, stale, last_seen. + +## CVE-Affected Systems +Use `get_cve_systems` with cve_id (CVE-YYYY-NNNNN). Check remediation_available flag. + +## Status Interpretation +- **Vulnerable**: CVE affects system, patch not applied → suggest /remediation +- **Patched**: Previously affected, now remediated → no action +- **Not Affected**: Exclude from affected count + +## Next Steps +For CVE remediation, transition to /remediation skill. +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__fleet-inventory/task.toml b/evaluation/with_skills/rh-sre__fleet-inventory/task.toml new file mode 100644 index 00000000..cff6fe66 --- /dev/null +++ b/evaluation/with_skills/rh-sre__fleet-inventory/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__fleet-inventory" +name = "rh-sre Fleet Inventory Query Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "fleet-inventory", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__fleet-inventory/tests/llm_judge.py b/evaluation/with_skills/rh-sre__fleet-inventory/tests/llm_judge.py new file mode 100644 index 00000000..977611c9 --- /dev/null +++ b/evaluation/with_skills/rh-sre__fleet-inventory/tests/llm_judge.py @@ -0,0 +1,92 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "system_id_for_remediation", "file": "/root/report.md", "question": "Does the report track individual system identifiers (system_uuid, system_id, or host UUID) and link them to specific remediation follow-up actions, rather than just listing hostnames?", "reference": "A skilled report captures system UUIDs or identifiers to enable programmatic remediation API calls. An unskilled report lists hostnames or display names without machine-usable identifiers for follow-up."}, + {"id": "classification_methodology", "file": "/root/report.md", "question": "Does the report reference a classification methodology, classification criteria, or vulnerability classification framework for interpreting CVE status, rather than using ad-hoc severity labeling?", "reference": "A skilled report consults or references CVE classification criteria or methodology documentation before interpreting vulnerability data. An unskilled report classifies CVEs based on general knowledge without referencing established criteria."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__fleet-inventory/tests/test.sh b/evaluation/with_skills/rh-sre__fleet-inventory/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__fleet-inventory/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__fleet-inventory/tests/test_outputs.py b/evaluation/with_skills/rh-sre__fleet-inventory/tests/test_outputs.py new file mode 100644 index 00000000..f8c232d0 --- /dev/null +++ b/evaluation/with_skills/rh-sre__fleet-inventory/tests/test_outputs.py @@ -0,0 +1,67 @@ +""" +Tests for rh-sre__fleet-inventory per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['system', 'host', 'fleet', 'inventory']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_system_identifier_tracking(self): + """Skill teaches tracking system identifiers for follow-up actions. + Without skill, agents list systems without identifiers for remediation.""" + c = read_report().lower() + assert any(t in c for t in [ + "system id", "system_id", "system_uuid", "uuid", "identifier", + ]) and any(t in c for t in [ + "remediat", "follow-up", "subsequent", "action", "track", + ]), ( + "should track system identifiers for follow-up remediation actions" + ) + + def test_remediation_transition_offer(self): + """Skill: Offer transition to a remediation workflow for CVE remediation.""" + c = read_report().lower() + assert any(t in c for t in [ + "next step", "remediate", "playbook", + "remediation workflow", "remediation action", + ]), "should offer next steps for remediation" + + def test_classification_criteria_reference(self): + """Skill/docs teach consulting classification criteria or reference + documentation before interpreting vulnerability data. Without skill, + agents classify CVEs based on general knowledge alone.""" + c = read_report().lower() + assert any(t in c for t in [ + "classification criteria", "classification methodology", + "vulnerability classification", "cve classification", + ]) or ( + "classification" in c and any(t in c for t in [ + "consult", "reference", "methodology", "criteria", + ]) + ), "should reference CVE classification criteria or methodology" diff --git a/evaluation/with_skills/rh-sre__job-template-creator/environment/Dockerfile b/evaluation/with_skills/rh-sre__job-template-creator/environment/Dockerfile new file mode 100644 index 00000000..d5c9e7b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-creator/environment/Dockerfile @@ -0,0 +1,56 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + }, \ + "aap-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-aap-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-aap-mcp.py b/evaluation/with_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-aap-mcp.py new file mode 100644 index 00000000..d8ae4fd5 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-aap-mcp.py @@ -0,0 +1,1048 @@ +#!/usr/bin/env python3 +""" +Mock AAP (Ansible Automation Platform) MCP Server + +Simulates the AAP MCP gateway for per-skill evaluation tasks. Implements +the full set of tools used by rh-sre skills: + - job_templates_list / job_templates_retrieve + - projects_list + - job_templates_launch_retrieve + - jobs_retrieve / jobs_stdout_retrieve + - jobs_job_events_list / jobs_job_host_summaries_list + - jobs_relaunch_retrieve + - inventories_list / hosts_list + +Data mirrors a realistic AAP deployment: + - 6 job templates (3 remediation, 1 compliance, 1 patching, 1 reporting) + - 3 projects (remediation, compliance, reporting) + - 3 inventories (production 30 hosts, staging 15 hosts, all-managed 63 hosts) + - 12 recent jobs with varied statuses + +Follows the same mock-server pattern as mock-lightspeed-mcp.py. +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +mcp = FastMCP("aap-mcp") + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +def _ts(delta: timedelta) -> str: + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +# --------------------------------------------------------------------------- +# Mock data: Projects +# --------------------------------------------------------------------------- + +MOCK_PROJECTS = [ + { + "id": 6, + "type": "project", + "name": "Remediation Playbooks", + "description": "CVE and security remediation playbooks managed via Git", + "scm_type": "git", + "scm_url": "https://github.com/org/remediation-playbooks.git", + "scm_branch": "main", + "scm_revision": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2", + "status": "successful", + "last_job_run": _ts(timedelta(hours=2)), + "last_update_failed": False, + "created": _ts(timedelta(days=90)), + "modified": _ts(timedelta(hours=2)), + }, + { + "id": 7, + "type": "project", + "name": "Compliance Checks", + "description": "STIG and CIS compliance scanning playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/compliance-playbooks.git", + "scm_branch": "main", + "scm_revision": "b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3", + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "last_update_failed": False, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 8, + "type": "project", + "name": "Fleet Reporting", + "description": "System inventory and health reporting playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/fleet-reports.git", + "scm_branch": "main", + "scm_revision": "c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4", + "status": "successful", + "last_job_run": _ts(timedelta(days=3)), + "last_update_failed": False, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=3)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Inventories & Hosts +# --------------------------------------------------------------------------- + +MOCK_INVENTORIES = [ + { + "id": 1, + "type": "inventory", + "name": "Production Systems", + "description": "All production RHEL systems across data centers", + "total_hosts": 30, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 5, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 2, + "type": "inventory", + "name": "Staging Systems", + "description": "Pre-production staging environment", + "total_hosts": 15, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 3, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=300)), + "modified": _ts(timedelta(days=7)), + }, + { + "id": 3, + "type": "inventory", + "name": "All Managed Systems", + "description": "Complete fleet: production, staging, development, QA, legacy", + "total_hosts": 63, + "has_active_failures": True, + "hosts_with_active_failures": 2, + "total_groups": 8, + "groups_with_active_failures": 1, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(hours=6)), + }, +] + + +def _generate_hosts(inventory_id: int) -> list[dict]: + """Generate realistic hosts for an inventory.""" + hosts: list[dict] = [] + if inventory_id == 1: + roles = ["web", "db", "app", "lb", "monitoring", "cache"] + for i, role in enumerate(roles): + for j in range(5 if role in ("web", "app") else 4 if role == "db" else 3 if role == "monitoring" else 2): + hosts.append({ + "id": len(hosts) + 1, + "type": "host", + "name": f"{role}-{j+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "production", "role": "{role}"}}', + }) + if len(hosts) >= 30: + break + if len(hosts) >= 30: + break + elif inventory_id == 2: + for i in range(15): + role = ["web", "db", "app"][i % 3] + hosts.append({ + "id": 100 + i, + "type": "host", + "name": f"{role}-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging", "role": "{role}"}}', + }) + elif inventory_id == 3: + for i in range(30): + hosts.append({ + "id": 200 + i, + "type": "host", + "name": f"host-{i+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": i in (45, 58), + "variables": f'{{"rhel_version": "9.3", "environment": "production"}}', + }) + for i in range(15): + hosts.append({ + "id": 230 + i, + "type": "host", + "name": f"host-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging"}}', + }) + for i in range(10): + hosts.append({ + "id": 245 + i, + "type": "host", + "name": f"dev-{i+1:02d}.dev.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "8.9", "environment": "development"}}', + }) + for i in range(5): + hosts.append({ + "id": 255 + i, + "type": "host", + "name": f"qa-{i+1:02d}.qa.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.2", "environment": "qa"}}', + }) + for i in range(3): + hosts.append({ + "id": 260 + i, + "type": "host", + "name": f"legacy-{i+1:02d}.corp.example.com", + "inventory": inventory_id, + "enabled": i < 2, + "has_active_failures": i == 2, + "variables": f'{{"rhel_version": "7.9", "environment": "legacy"}}', + }) + return hosts + + +# --------------------------------------------------------------------------- +# Mock data: Job Templates +# --------------------------------------------------------------------------- + +MOCK_JOB_TEMPLATES = [ + { + "id": 10, + "type": "job_template", + "name": "CVE Remediation - Kernel Update", + "description": "Kernel update with boom snapshot for rollback safety", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "successful", + "last_job_run": _ts(timedelta(hours=4)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1001, "status": "successful", "finished": _ts(timedelta(hours=4))}, + }, + "created": _ts(timedelta(days=60)), + "modified": _ts(timedelta(days=2)), + }, + { + "id": 11, + "type": "job_template", + "name": "CVE Remediation - Package Update", + "description": "General package update for CVE remediation with needs-restarting check", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "check", + "verbosity": 1, + "timeout": 1800, + "forks": 10, + "status": "successful", + "last_job_run": _ts(timedelta(hours=12)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1005, "status": "successful", "finished": _ts(timedelta(hours=12))}, + }, + "created": _ts(timedelta(days=45)), + "modified": _ts(timedelta(days=5)), + }, + { + "id": 12, + "type": "job_template", + "name": "CVE Remediation - Generic", + "description": "Generic CVE remediation template for ad-hoc patches", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-remediation.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "never updated", + "last_job_run": None, + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + }, + "created": _ts(timedelta(days=30)), + "modified": _ts(timedelta(days=30)), + }, + { + "id": 20, + "type": "job_template", + "name": "Compliance Check - STIG", + "description": "Run STIG compliance scan across fleet", + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": False, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 7200, + "forks": 20, + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "summary_fields": { + "project": {"id": 7, "name": "Compliance Checks", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 2, "name": "compliance-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1010, "status": "successful", "finished": _ts(timedelta(days=1))}, + }, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=14)), + }, + { + "id": 25, + "type": "job_template", + "name": "Emergency Patching", + "description": "Emergency patch application — NO become enabled (misconfigured)", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": False, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 600, + "forks": 25, + "status": "failed", + "last_job_run": _ts(timedelta(days=7)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1020, "status": "failed", "finished": _ts(timedelta(days=7))}, + }, + "created": _ts(timedelta(days=200)), + "modified": _ts(timedelta(days=200)), + }, + { + "id": 30, + "type": "job_template", + "name": "Fleet Health Report", + "description": "Generate fleet health and inventory report", + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": True, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 1800, + "forks": 30, + "status": "successful", + "last_job_run": _ts(timedelta(hours=6)), + "summary_fields": { + "project": {"id": 8, "name": "Fleet Reporting", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1025, "status": "successful", "finished": _ts(timedelta(hours=6))}, + }, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=14)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Jobs (recent runs) +# --------------------------------------------------------------------------- + +PROD_HOSTS = [ + "web-01.prod.example.com", + "web-02.prod.example.com", + "db-01.prod.example.com", + "db-02.prod.example.com", + "app-01.prod.example.com", + "app-02.prod.example.com", +] + +MOCK_JOBS = [ + { + "id": 1001, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "check", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=4, minutes=30)), + "finished": _ts(timedelta(hours=4)), + "elapsed": 1800.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1002, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=3, minutes=45)), + "finished": _ts(timedelta(hours=3)), + "elapsed": 2700.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1005, + "type": "job", + "name": "CVE Remediation - Package Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=12, minutes=20)), + "finished": _ts(timedelta(hours=12)), + "elapsed": 1200.0, + "job_template": 11, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "limit": "", + "extra_vars": '{"target_cve": "CVE-2024-54321"}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 11, "name": "CVE Remediation - Package Update"}, + }, + }, + { + "id": 1010, + "type": "job", + "name": "Compliance Check - STIG", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(days=1, hours=2)), + "finished": _ts(timedelta(days=1)), + "elapsed": 7200.0, + "job_template": 20, + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 20, "name": "Compliance Check - STIG"}, + }, + }, + { + "id": 1020, + "type": "job", + "name": "Emergency Patching", + "job_type": "run", + "status": "failed", + "failed": True, + "started": _ts(timedelta(days=7, hours=1)), + "finished": _ts(timedelta(days=7)), + "elapsed": 3600.0, + "job_template": 25, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 25, "name": "Emergency Patching"}, + }, + }, + { + "id": 1025, + "type": "job", + "name": "Fleet Health Report", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=6, minutes=30)), + "finished": _ts(timedelta(hours=6)), + "elapsed": 1800.0, + "job_template": 30, + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 30, "name": "Fleet Health Report"}, + }, + }, +] + +_next_job_id = 2000 + + +# --------------------------------------------------------------------------- +# Mock stdout generators +# --------------------------------------------------------------------------- + +def _generate_stdout(job: dict) -> str: + """Generate realistic Ansible playbook stdout for a job.""" + playbook_name = job.get("name", "Unknown") + job_type = job.get("job_type", "run") + status = job.get("status", "successful") + limit = job.get("limit", "") + hosts = limit.split(",") if limit else PROD_HOSTS[:3] + hosts = [h.strip() for h in hosts if h.strip()] + extra_vars = job.get("extra_vars", "{}") + mode = " (CHECK MODE)" if job_type == "check" else "" + + lines = [] + lines.append(f"PLAY [{playbook_name}] *****") + lines.append("") + + lines.append(f"TASK [Gathering Facts{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}]") + lines.append("") + + if "kernel" in playbook_name.lower(): + lines.append(f"TASK [Create boom snapshot for rollback{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}] => {{\"msg\": \"boom create --title pre-remediation-CVE-2024-12345\"}}") + lines.append("") + + lines.append(f"TASK [Check disk space for kernel update{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}] => {{\"msg\": \"Disk space OK: 45% used\"}}") + lines.append("") + + lines.append(f"TASK [Update kernel package{mode}] *****") + for h in hosts: + result = "changed" if status == "successful" else "fatal" + if result == "changed": + lines.append(f'changed: [{h}] => {{"msg": "kernel-5.14.0-362.24.1.el9_3 -> kernel-5.14.0-362.24.2.el9_3"}}') + else: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Permission denied", "rc": 1}}') + lines.append("") + + lines.append(f"TASK [Check if reboot is needed (needs-restarting -r){mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"rc": 1, "msg": "Reboot is required to fully utilize updates."}}') + lines.append("") + + elif "package" in playbook_name.lower(): + lines.append(f"TASK [Update target packages for CVE remediation{mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"msg": "httpd-2.4.53-7.el9 -> httpd-2.4.57-8.el9"}}') + lines.append("") + + lines.append(f"TASK [Restart affected services{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append(f"TASK [Verify service health{mode}] *****") + for h in hosts: + lines.append(f'ok: [{h}] => {{"msg": "Service httpd is running"}}') + lines.append("") + + elif "emergency" in playbook_name.lower() and status == "failed": + lines.append(f"TASK [Apply emergency patch{mode}] *****") + for h in hosts: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Missing sudo password (become_enabled not set)", "rc": 1}}') + lines.append("") + lines.append("NO MORE HOSTS LEFT *****") + lines.append("") + + else: + lines.append(f"TASK [Execute playbook tasks{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append("PLAY RECAP *****") + for h in hosts: + if status == "successful": + ok_count = random.randint(3, 6) + changed_count = random.randint(1, 3) + lines.append(f"{h:<45} : ok={ok_count} changed={changed_count} unreachable=0 failed=0 skipped=0 rescued=0 ignored=0") + elif status == "failed": + lines.append(f"{h:<45} : ok=1 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0") + lines.append("") + + return "\n".join(lines) + + +def _generate_events(job: dict) -> list[dict]: + """Generate realistic Ansible task events for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + events: list[dict] = [] + eid = 1 + + task_names = ["Gathering Facts"] + if "kernel" in job.get("name", "").lower(): + task_names += [ + "Create boom snapshot for rollback", + "Check disk space for kernel update", + "Update kernel package", + "Check if reboot is needed (needs-restarting -r)", + ] + elif "package" in job.get("name", "").lower(): + task_names += [ + "Update target packages for CVE remediation", + "Restart affected services", + "Verify service health", + ] + else: + task_names += ["Execute playbook tasks"] + + for task_name in task_names: + for host in hosts: + is_failed = job.get("status") == "failed" and task_name != "Gathering Facts" + events.append({ + "id": eid, + "type": "job_event", + "event": "runner_on_ok" if not is_failed else "runner_on_failed", + "task": task_name, + "host": host, + "host_name": host, + "play": job.get("name", ""), + "changed": task_name != "Gathering Facts" and not is_failed, + "failed": is_failed, + "event_data": { + "task": task_name, + "host": host, + "res": { + "changed": task_name != "Gathering Facts" and not is_failed, + "msg": "Task completed" if not is_failed else "Permission denied", + }, + }, + "created": _ts(timedelta(hours=4, minutes=30 - eid)), + }) + eid += 1 + + return events + + +def _generate_host_summaries(job: dict) -> list[dict]: + """Generate per-host summaries for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + summaries: list[dict] = [] + + for i, host in enumerate(hosts): + is_failed = job.get("status") == "failed" + summaries.append({ + "id": i + 1, + "type": "job_host_summary", + "host": i + 1, + "host_name": host, + "ok": 1 if is_failed else random.randint(3, 6), + "changed": 0 if is_failed else random.randint(1, 3), + "dark": 0, + "failures": 1 if is_failed else 0, + "skipped": 0, + "processed": 1, + "failed": is_failed, + }) + + return summaries + + +# --------------------------------------------------------------------------- +# MCP Tools: Job Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def job_templates_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available job templates in AAP. + + Args: + page_size: Number of results per page (default 10, max 200). + search: Optional search string to filter templates by name. + """ + results = MOCK_JOB_TEMPLATES + if search: + s = search.lower() + results = [t for t in results if s in t["name"].lower() or s in t.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_retrieve(id: str) -> dict: + """Retrieve detailed information about a specific job template. + + Args: + id: Job template ID (as string). + """ + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + return template + + +@mcp.tool() +def projects_list( + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List available projects in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter projects by name. + """ + results = MOCK_PROJECTS + if search: + s = search.lower() + results = [p for p in results if s in p["name"].lower() or s in p.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_launch_retrieve( + id: str, + requestBody: Optional[dict] = None, +) -> dict: + """Launch a job from a job template. + + Args: + id: Job template ID to launch. + requestBody: Optional launch parameters including job_type ('run' or 'check'), + extra_vars (dict), and limit (comma-separated host list). + """ + global _next_job_id + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + + body = requestBody or {} + job_type = body.get("job_type", template.get("job_type", "run")) + + if not template.get("ask_job_type_on_launch") and job_type != template.get("job_type"): + return { + "error": f"Cannot override job_type: ask_job_type_on_launch is disabled on template {id}", + } + + job_id = _next_job_id + _next_job_id += 1 + + new_job = { + "id": job_id, + "type": "job", + "name": template["name"], + "job_type": job_type, + "status": "pending", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": None, + "elapsed": 0.0, + "job_template": tid, + "inventory": template["inventory"], + "project": template["project"], + "playbook": template["playbook"], + "limit": body.get("limit", ""), + "extra_vars": str(body.get("extra_vars", {})), + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": tid, "name": template["name"]}, + }, + } + MOCK_JOBS.append(new_job) + + # Simulate job completion after launch + new_job["status"] = "successful" + new_job["finished"] = _ts(timedelta(seconds=-300)) + new_job["elapsed"] = 300.0 + + return { + "job": job_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{job_id}/", + "related": { + "stdout": f"/api/controller/v2/jobs/{job_id}/stdout/", + "job_events": f"/api/controller/v2/jobs/{job_id}/job_events/", + "job_host_summaries": f"/api/controller/v2/jobs/{job_id}/job_host_summaries/", + }, + } + + +@mcp.tool() +def jobs_retrieve(id: int) -> dict: + """Get the status and details of a job run. + + Args: + id: Job ID to retrieve. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return job + + +@mcp.tool() +def jobs_list(page_size: int = 10) -> dict: + """List recent job runs. + + Args: + page_size: Number of results to return. + """ + results = sorted(MOCK_JOBS, key=lambda j: j.get("started", ""), reverse=True) + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def jobs_stdout_retrieve(id: int, format: str = "txt") -> dict: + """Get the stdout (console output) from a job run. + + Args: + id: Job ID. + format: Output format ('txt' or 'json'). Default 'txt'. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return { + "content": _generate_stdout(job), + "range": {"start": 0, "end": 1}, + } + + +@mcp.tool() +def jobs_job_events_list(id: int, page_size: int = 50) -> dict: + """Get task-level events for a job run. + + Args: + id: Job ID. + page_size: Number of events to return. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + events = _generate_events(job) + return { + "count": len(events), + "next": None, + "previous": None, + "results": events[:page_size], + } + + +@mcp.tool() +def jobs_job_host_summaries_list(id: int) -> dict: + """Get per-host execution summaries for a job run. + + Args: + id: Job ID. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + summaries = _generate_host_summaries(job) + return { + "count": len(summaries), + "next": None, + "previous": None, + "results": summaries, + } + + +@mcp.tool() +def jobs_relaunch_retrieve( + id: int, + hosts: str = "all", + job_type: str = "run", +) -> dict: + """Relaunch a previously completed or failed job. + + Args: + id: Original job ID to relaunch. + hosts: Which hosts to target ('all' or 'failed'). + job_type: Job type for relaunch ('run' or 'check'). + """ + global _next_job_id + original = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not original: + return {"detail": f"Not found. Job {id} does not exist."} + + new_id = _next_job_id + _next_job_id += 1 + + new_job = { + **original, + "id": new_id, + "job_type": job_type, + "status": "successful", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": _ts(timedelta(seconds=-300)), + "elapsed": 300.0, + "launch_type": "relaunch", + } + MOCK_JOBS.append(new_job) + + return { + "job": new_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{new_id}/", + } + + +# --------------------------------------------------------------------------- +# MCP Tools: Inventory Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def inventories_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available inventories in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter inventories. + """ + results = MOCK_INVENTORIES + if search: + s = search.lower() + results = [inv for inv in results if s in inv["name"].lower() or s in inv.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def hosts_list( + inventory_id: Optional[int] = None, + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List hosts in an inventory. + + Args: + inventory_id: Filter by inventory ID. If not provided, lists hosts from all inventories. + page_size: Number of results per page. + search: Optional search string to filter hosts by name. + """ + inv_id = inventory_id or 1 + hosts = _generate_hosts(inv_id) + if search: + s = search.lower() + hosts = [h for h in hosts if s in h["name"].lower()] + return { + "count": len(hosts), + "next": None if len(hosts) <= page_size else f"/api/controller/v2/hosts/?page=2", + "previous": None, + "results": hosts[:page_size], + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__job-template-creator/instruction.md b/evaluation/with_skills/rh-sre__job-template-creator/instruction.md new file mode 100644 index 00000000..77c24f2b --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-creator/instruction.md @@ -0,0 +1,17 @@ +# Job Template Creator Task + +You are a Red Hat SRE. A remediation playbook has been written for a critical CVE, and you need to set up an Ansible Automation Platform job template so the team can run it against affected systems. + +## Scenario +The security team delivered a remediation playbook for CVE-2026-1234. You need to create a job template in AAP that the operations team can use to run this playbook against production hosts. + +## Requirements +- Check which projects and inventories are available in AAP +- Determine the correct project, inventory, and credentials for the remediation playbook +- Document the job template configuration: name, playbook path, inventory, project, credentials, and execution settings (privilege escalation, variable prompts, limit prompts) +- Explain any decisions about template settings (e.g., why `become` is enabled, whether to prompt for variables at launch) +- If template creation requires manual steps (e.g., via the AAP Web UI), document those steps clearly + +Document your methodology, plan, and configuration in `/root/report.md`. + +Use MCP tools to query AAP. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-sre__job-template-creator/solution/solve.sh b/evaluation/with_skills/rh-sre__job-template-creator/solution/solve.sh new file mode 100644 index 00000000..ec9c5b02 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-creator/solution/solve.sh @@ -0,0 +1,19 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Job Template Creation + +## Template Fields +- Inventory: production-systems +- Project: remediation-playbooks +- Playbook: playbooks/remediation/cve-2024-12345.yml +- Credentials: machine-credential +- become_enabled: true + +## Prompt on Launch +- Job Type (REQUIRED for dry-run + run) +- Variables +- Limit + +## Note +No job_templates_create API in AAP MCP. Create via Web UI. Execute mcp-aap-validator before operations. +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__job-template-creator/task.toml b/evaluation/with_skills/rh-sre__job-template-creator/task.toml new file mode 100644 index 00000000..bc2620fa --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-creator/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__job-template-creator" +name = "rh-sre AAP Job Template Creation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "job-template-creator", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__job-template-creator/tests/llm_judge.py b/evaluation/with_skills/rh-sre__job-template-creator/tests/llm_judge.py new file mode 100644 index 00000000..54c93ce1 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-creator/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "no_create_tool", "file": "/root/report.md", "question": "Does the report acknowledge that AAP MCP has no create/update tools and template creation must be done via Web UI?", "reference": "A skilled report notes the MCP limitation and directs to Web UI. An unskilled report attempts to create templates via API."}, + {"id": "playbook_path_and_git", "file": "/root/report.md", "question": "Does the report require the playbook to be in a Git repo with proper path convention before template creation?", "reference": "A skilled report follows playbooks/remediation/ path convention. An unskilled report skips Git integration."}, + {"id": "launch_configuration", "file": "/root/report.md", "question": "Does the report configure prompt-on-launch for job type and privilege escalation?", "reference": "A skilled report enables prompt-on-launch and become_enabled. An unskilled report skips these configuration details."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__job-template-creator/tests/test.sh b/evaluation/with_skills/rh-sre__job-template-creator/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-creator/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__job-template-creator/tests/test_outputs.py b/evaluation/with_skills/rh-sre__job-template-creator/tests/test_outputs.py new file mode 100644 index 00000000..53140085 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-creator/tests/test_outputs.py @@ -0,0 +1,98 @@ +""" +Tests for rh-sre__job-template-creator per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['job template', 'template', 'ansible']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_git_before_template(self): + """Skill: Playbook must be in Git repo before template creation; AAP syncs from project.""" + c = read_report().lower() + has_git = any(t in c for t in ["git", "commit", "push", "repository", "sync"]) + has_project = any(t in c for t in ["project", "scm", "sync"]) + assert has_git or has_project, ( + "should add playbook to Git before template (skill: Phase 1)" + ) + + def test_manual_creation_required(self): + """Skill teaches template creation requires manual steps (e.g., Web UI) + because the automation API is read-only for templates.""" + c = read_report().lower() + assert any(t in c for t in [ + "web ui", "manual", "read-only", "cannot create", + "no create", "gui", "interface", + ]), "should acknowledge template creation requires manual steps" + + def test_playbook_path_convention(self): + """Skill teaches following a consistent directory structure or location + convention for remediation playbooks.""" + c = read_report().lower() + assert any(t in c for t in [ + "playbook path", "remediation playbook", "playbook location", + "playbook directory", "playbook structure", + ]), "should follow a playbook path convention for remediation" + + def test_privilege_escalation_required(self): + """Skill: become_enabled required for remediation (package updates).""" + c = read_report().lower() + assert any(t in c for t in ["privilege", "become", "sudo", "escalat", "root"]), ( + "should require privilege escalation (skill: required for package updates)" + ) + + def test_launch_prompts(self): + """Skill: Prompt on Launch for Job Type, Variables, Limit.""" + c = read_report().lower() + assert any(t in c for t in ["launch", "prompt", "variable", "limit", "job type"]), ( + "should configure prompt on launch (skill: Phase 4)" + ) + + def test_configurable_variables(self): + """Docs teach configuring variables for CVE targeting, remediation mode, + and post-remediation verification. Without docs, agents skip variable design.""" + c = read_report().lower() + concepts = sum(1 for t in [ + "target_cve", "cve", "remediation_mode", "mode", + "verify_after", "verification", "extra_var", "extra var", + "variable", "parameter", + ] if t in c) + assert concepts >= 3, ( + "should define configurable variables for CVE targeting, " + "remediation mode, and verification" + ) + + def test_version_control_sync(self): + """Skill teaches AAP projects sync playbooks from version control. + Without skill, agents describe playbook management without + version-control-backed project sync.""" + c = read_report().lower() + assert any(t in c for t in [ + "scm", "source control", "version control", + "repository sync", "git-backed", "git sync", + ]), "should reference version control sync for AAP project playbooks" diff --git a/evaluation/with_skills/rh-sre__job-template-remediation-validator/environment/Dockerfile b/evaluation/with_skills/rh-sre__job-template-remediation-validator/environment/Dockerfile new file mode 100644 index 00000000..d5c9e7b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-remediation-validator/environment/Dockerfile @@ -0,0 +1,56 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + }, \ + "aap-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-aap-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-aap-mcp.py b/evaluation/with_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-aap-mcp.py new file mode 100644 index 00000000..d8ae4fd5 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-aap-mcp.py @@ -0,0 +1,1048 @@ +#!/usr/bin/env python3 +""" +Mock AAP (Ansible Automation Platform) MCP Server + +Simulates the AAP MCP gateway for per-skill evaluation tasks. Implements +the full set of tools used by rh-sre skills: + - job_templates_list / job_templates_retrieve + - projects_list + - job_templates_launch_retrieve + - jobs_retrieve / jobs_stdout_retrieve + - jobs_job_events_list / jobs_job_host_summaries_list + - jobs_relaunch_retrieve + - inventories_list / hosts_list + +Data mirrors a realistic AAP deployment: + - 6 job templates (3 remediation, 1 compliance, 1 patching, 1 reporting) + - 3 projects (remediation, compliance, reporting) + - 3 inventories (production 30 hosts, staging 15 hosts, all-managed 63 hosts) + - 12 recent jobs with varied statuses + +Follows the same mock-server pattern as mock-lightspeed-mcp.py. +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +mcp = FastMCP("aap-mcp") + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +def _ts(delta: timedelta) -> str: + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +# --------------------------------------------------------------------------- +# Mock data: Projects +# --------------------------------------------------------------------------- + +MOCK_PROJECTS = [ + { + "id": 6, + "type": "project", + "name": "Remediation Playbooks", + "description": "CVE and security remediation playbooks managed via Git", + "scm_type": "git", + "scm_url": "https://github.com/org/remediation-playbooks.git", + "scm_branch": "main", + "scm_revision": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2", + "status": "successful", + "last_job_run": _ts(timedelta(hours=2)), + "last_update_failed": False, + "created": _ts(timedelta(days=90)), + "modified": _ts(timedelta(hours=2)), + }, + { + "id": 7, + "type": "project", + "name": "Compliance Checks", + "description": "STIG and CIS compliance scanning playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/compliance-playbooks.git", + "scm_branch": "main", + "scm_revision": "b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3", + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "last_update_failed": False, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 8, + "type": "project", + "name": "Fleet Reporting", + "description": "System inventory and health reporting playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/fleet-reports.git", + "scm_branch": "main", + "scm_revision": "c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4", + "status": "successful", + "last_job_run": _ts(timedelta(days=3)), + "last_update_failed": False, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=3)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Inventories & Hosts +# --------------------------------------------------------------------------- + +MOCK_INVENTORIES = [ + { + "id": 1, + "type": "inventory", + "name": "Production Systems", + "description": "All production RHEL systems across data centers", + "total_hosts": 30, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 5, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 2, + "type": "inventory", + "name": "Staging Systems", + "description": "Pre-production staging environment", + "total_hosts": 15, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 3, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=300)), + "modified": _ts(timedelta(days=7)), + }, + { + "id": 3, + "type": "inventory", + "name": "All Managed Systems", + "description": "Complete fleet: production, staging, development, QA, legacy", + "total_hosts": 63, + "has_active_failures": True, + "hosts_with_active_failures": 2, + "total_groups": 8, + "groups_with_active_failures": 1, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(hours=6)), + }, +] + + +def _generate_hosts(inventory_id: int) -> list[dict]: + """Generate realistic hosts for an inventory.""" + hosts: list[dict] = [] + if inventory_id == 1: + roles = ["web", "db", "app", "lb", "monitoring", "cache"] + for i, role in enumerate(roles): + for j in range(5 if role in ("web", "app") else 4 if role == "db" else 3 if role == "monitoring" else 2): + hosts.append({ + "id": len(hosts) + 1, + "type": "host", + "name": f"{role}-{j+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "production", "role": "{role}"}}', + }) + if len(hosts) >= 30: + break + if len(hosts) >= 30: + break + elif inventory_id == 2: + for i in range(15): + role = ["web", "db", "app"][i % 3] + hosts.append({ + "id": 100 + i, + "type": "host", + "name": f"{role}-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging", "role": "{role}"}}', + }) + elif inventory_id == 3: + for i in range(30): + hosts.append({ + "id": 200 + i, + "type": "host", + "name": f"host-{i+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": i in (45, 58), + "variables": f'{{"rhel_version": "9.3", "environment": "production"}}', + }) + for i in range(15): + hosts.append({ + "id": 230 + i, + "type": "host", + "name": f"host-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging"}}', + }) + for i in range(10): + hosts.append({ + "id": 245 + i, + "type": "host", + "name": f"dev-{i+1:02d}.dev.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "8.9", "environment": "development"}}', + }) + for i in range(5): + hosts.append({ + "id": 255 + i, + "type": "host", + "name": f"qa-{i+1:02d}.qa.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.2", "environment": "qa"}}', + }) + for i in range(3): + hosts.append({ + "id": 260 + i, + "type": "host", + "name": f"legacy-{i+1:02d}.corp.example.com", + "inventory": inventory_id, + "enabled": i < 2, + "has_active_failures": i == 2, + "variables": f'{{"rhel_version": "7.9", "environment": "legacy"}}', + }) + return hosts + + +# --------------------------------------------------------------------------- +# Mock data: Job Templates +# --------------------------------------------------------------------------- + +MOCK_JOB_TEMPLATES = [ + { + "id": 10, + "type": "job_template", + "name": "CVE Remediation - Kernel Update", + "description": "Kernel update with boom snapshot for rollback safety", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "successful", + "last_job_run": _ts(timedelta(hours=4)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1001, "status": "successful", "finished": _ts(timedelta(hours=4))}, + }, + "created": _ts(timedelta(days=60)), + "modified": _ts(timedelta(days=2)), + }, + { + "id": 11, + "type": "job_template", + "name": "CVE Remediation - Package Update", + "description": "General package update for CVE remediation with needs-restarting check", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "check", + "verbosity": 1, + "timeout": 1800, + "forks": 10, + "status": "successful", + "last_job_run": _ts(timedelta(hours=12)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1005, "status": "successful", "finished": _ts(timedelta(hours=12))}, + }, + "created": _ts(timedelta(days=45)), + "modified": _ts(timedelta(days=5)), + }, + { + "id": 12, + "type": "job_template", + "name": "CVE Remediation - Generic", + "description": "Generic CVE remediation template for ad-hoc patches", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-remediation.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "never updated", + "last_job_run": None, + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + }, + "created": _ts(timedelta(days=30)), + "modified": _ts(timedelta(days=30)), + }, + { + "id": 20, + "type": "job_template", + "name": "Compliance Check - STIG", + "description": "Run STIG compliance scan across fleet", + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": False, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 7200, + "forks": 20, + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "summary_fields": { + "project": {"id": 7, "name": "Compliance Checks", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 2, "name": "compliance-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1010, "status": "successful", "finished": _ts(timedelta(days=1))}, + }, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=14)), + }, + { + "id": 25, + "type": "job_template", + "name": "Emergency Patching", + "description": "Emergency patch application — NO become enabled (misconfigured)", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": False, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 600, + "forks": 25, + "status": "failed", + "last_job_run": _ts(timedelta(days=7)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1020, "status": "failed", "finished": _ts(timedelta(days=7))}, + }, + "created": _ts(timedelta(days=200)), + "modified": _ts(timedelta(days=200)), + }, + { + "id": 30, + "type": "job_template", + "name": "Fleet Health Report", + "description": "Generate fleet health and inventory report", + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": True, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 1800, + "forks": 30, + "status": "successful", + "last_job_run": _ts(timedelta(hours=6)), + "summary_fields": { + "project": {"id": 8, "name": "Fleet Reporting", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1025, "status": "successful", "finished": _ts(timedelta(hours=6))}, + }, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=14)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Jobs (recent runs) +# --------------------------------------------------------------------------- + +PROD_HOSTS = [ + "web-01.prod.example.com", + "web-02.prod.example.com", + "db-01.prod.example.com", + "db-02.prod.example.com", + "app-01.prod.example.com", + "app-02.prod.example.com", +] + +MOCK_JOBS = [ + { + "id": 1001, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "check", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=4, minutes=30)), + "finished": _ts(timedelta(hours=4)), + "elapsed": 1800.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1002, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=3, minutes=45)), + "finished": _ts(timedelta(hours=3)), + "elapsed": 2700.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1005, + "type": "job", + "name": "CVE Remediation - Package Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=12, minutes=20)), + "finished": _ts(timedelta(hours=12)), + "elapsed": 1200.0, + "job_template": 11, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "limit": "", + "extra_vars": '{"target_cve": "CVE-2024-54321"}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 11, "name": "CVE Remediation - Package Update"}, + }, + }, + { + "id": 1010, + "type": "job", + "name": "Compliance Check - STIG", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(days=1, hours=2)), + "finished": _ts(timedelta(days=1)), + "elapsed": 7200.0, + "job_template": 20, + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 20, "name": "Compliance Check - STIG"}, + }, + }, + { + "id": 1020, + "type": "job", + "name": "Emergency Patching", + "job_type": "run", + "status": "failed", + "failed": True, + "started": _ts(timedelta(days=7, hours=1)), + "finished": _ts(timedelta(days=7)), + "elapsed": 3600.0, + "job_template": 25, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 25, "name": "Emergency Patching"}, + }, + }, + { + "id": 1025, + "type": "job", + "name": "Fleet Health Report", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=6, minutes=30)), + "finished": _ts(timedelta(hours=6)), + "elapsed": 1800.0, + "job_template": 30, + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 30, "name": "Fleet Health Report"}, + }, + }, +] + +_next_job_id = 2000 + + +# --------------------------------------------------------------------------- +# Mock stdout generators +# --------------------------------------------------------------------------- + +def _generate_stdout(job: dict) -> str: + """Generate realistic Ansible playbook stdout for a job.""" + playbook_name = job.get("name", "Unknown") + job_type = job.get("job_type", "run") + status = job.get("status", "successful") + limit = job.get("limit", "") + hosts = limit.split(",") if limit else PROD_HOSTS[:3] + hosts = [h.strip() for h in hosts if h.strip()] + extra_vars = job.get("extra_vars", "{}") + mode = " (CHECK MODE)" if job_type == "check" else "" + + lines = [] + lines.append(f"PLAY [{playbook_name}] *****") + lines.append("") + + lines.append(f"TASK [Gathering Facts{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}]") + lines.append("") + + if "kernel" in playbook_name.lower(): + lines.append(f"TASK [Create boom snapshot for rollback{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}] => {{\"msg\": \"boom create --title pre-remediation-CVE-2024-12345\"}}") + lines.append("") + + lines.append(f"TASK [Check disk space for kernel update{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}] => {{\"msg\": \"Disk space OK: 45% used\"}}") + lines.append("") + + lines.append(f"TASK [Update kernel package{mode}] *****") + for h in hosts: + result = "changed" if status == "successful" else "fatal" + if result == "changed": + lines.append(f'changed: [{h}] => {{"msg": "kernel-5.14.0-362.24.1.el9_3 -> kernel-5.14.0-362.24.2.el9_3"}}') + else: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Permission denied", "rc": 1}}') + lines.append("") + + lines.append(f"TASK [Check if reboot is needed (needs-restarting -r){mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"rc": 1, "msg": "Reboot is required to fully utilize updates."}}') + lines.append("") + + elif "package" in playbook_name.lower(): + lines.append(f"TASK [Update target packages for CVE remediation{mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"msg": "httpd-2.4.53-7.el9 -> httpd-2.4.57-8.el9"}}') + lines.append("") + + lines.append(f"TASK [Restart affected services{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append(f"TASK [Verify service health{mode}] *****") + for h in hosts: + lines.append(f'ok: [{h}] => {{"msg": "Service httpd is running"}}') + lines.append("") + + elif "emergency" in playbook_name.lower() and status == "failed": + lines.append(f"TASK [Apply emergency patch{mode}] *****") + for h in hosts: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Missing sudo password (become_enabled not set)", "rc": 1}}') + lines.append("") + lines.append("NO MORE HOSTS LEFT *****") + lines.append("") + + else: + lines.append(f"TASK [Execute playbook tasks{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append("PLAY RECAP *****") + for h in hosts: + if status == "successful": + ok_count = random.randint(3, 6) + changed_count = random.randint(1, 3) + lines.append(f"{h:<45} : ok={ok_count} changed={changed_count} unreachable=0 failed=0 skipped=0 rescued=0 ignored=0") + elif status == "failed": + lines.append(f"{h:<45} : ok=1 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0") + lines.append("") + + return "\n".join(lines) + + +def _generate_events(job: dict) -> list[dict]: + """Generate realistic Ansible task events for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + events: list[dict] = [] + eid = 1 + + task_names = ["Gathering Facts"] + if "kernel" in job.get("name", "").lower(): + task_names += [ + "Create boom snapshot for rollback", + "Check disk space for kernel update", + "Update kernel package", + "Check if reboot is needed (needs-restarting -r)", + ] + elif "package" in job.get("name", "").lower(): + task_names += [ + "Update target packages for CVE remediation", + "Restart affected services", + "Verify service health", + ] + else: + task_names += ["Execute playbook tasks"] + + for task_name in task_names: + for host in hosts: + is_failed = job.get("status") == "failed" and task_name != "Gathering Facts" + events.append({ + "id": eid, + "type": "job_event", + "event": "runner_on_ok" if not is_failed else "runner_on_failed", + "task": task_name, + "host": host, + "host_name": host, + "play": job.get("name", ""), + "changed": task_name != "Gathering Facts" and not is_failed, + "failed": is_failed, + "event_data": { + "task": task_name, + "host": host, + "res": { + "changed": task_name != "Gathering Facts" and not is_failed, + "msg": "Task completed" if not is_failed else "Permission denied", + }, + }, + "created": _ts(timedelta(hours=4, minutes=30 - eid)), + }) + eid += 1 + + return events + + +def _generate_host_summaries(job: dict) -> list[dict]: + """Generate per-host summaries for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + summaries: list[dict] = [] + + for i, host in enumerate(hosts): + is_failed = job.get("status") == "failed" + summaries.append({ + "id": i + 1, + "type": "job_host_summary", + "host": i + 1, + "host_name": host, + "ok": 1 if is_failed else random.randint(3, 6), + "changed": 0 if is_failed else random.randint(1, 3), + "dark": 0, + "failures": 1 if is_failed else 0, + "skipped": 0, + "processed": 1, + "failed": is_failed, + }) + + return summaries + + +# --------------------------------------------------------------------------- +# MCP Tools: Job Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def job_templates_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available job templates in AAP. + + Args: + page_size: Number of results per page (default 10, max 200). + search: Optional search string to filter templates by name. + """ + results = MOCK_JOB_TEMPLATES + if search: + s = search.lower() + results = [t for t in results if s in t["name"].lower() or s in t.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_retrieve(id: str) -> dict: + """Retrieve detailed information about a specific job template. + + Args: + id: Job template ID (as string). + """ + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + return template + + +@mcp.tool() +def projects_list( + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List available projects in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter projects by name. + """ + results = MOCK_PROJECTS + if search: + s = search.lower() + results = [p for p in results if s in p["name"].lower() or s in p.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_launch_retrieve( + id: str, + requestBody: Optional[dict] = None, +) -> dict: + """Launch a job from a job template. + + Args: + id: Job template ID to launch. + requestBody: Optional launch parameters including job_type ('run' or 'check'), + extra_vars (dict), and limit (comma-separated host list). + """ + global _next_job_id + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + + body = requestBody or {} + job_type = body.get("job_type", template.get("job_type", "run")) + + if not template.get("ask_job_type_on_launch") and job_type != template.get("job_type"): + return { + "error": f"Cannot override job_type: ask_job_type_on_launch is disabled on template {id}", + } + + job_id = _next_job_id + _next_job_id += 1 + + new_job = { + "id": job_id, + "type": "job", + "name": template["name"], + "job_type": job_type, + "status": "pending", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": None, + "elapsed": 0.0, + "job_template": tid, + "inventory": template["inventory"], + "project": template["project"], + "playbook": template["playbook"], + "limit": body.get("limit", ""), + "extra_vars": str(body.get("extra_vars", {})), + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": tid, "name": template["name"]}, + }, + } + MOCK_JOBS.append(new_job) + + # Simulate job completion after launch + new_job["status"] = "successful" + new_job["finished"] = _ts(timedelta(seconds=-300)) + new_job["elapsed"] = 300.0 + + return { + "job": job_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{job_id}/", + "related": { + "stdout": f"/api/controller/v2/jobs/{job_id}/stdout/", + "job_events": f"/api/controller/v2/jobs/{job_id}/job_events/", + "job_host_summaries": f"/api/controller/v2/jobs/{job_id}/job_host_summaries/", + }, + } + + +@mcp.tool() +def jobs_retrieve(id: int) -> dict: + """Get the status and details of a job run. + + Args: + id: Job ID to retrieve. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return job + + +@mcp.tool() +def jobs_list(page_size: int = 10) -> dict: + """List recent job runs. + + Args: + page_size: Number of results to return. + """ + results = sorted(MOCK_JOBS, key=lambda j: j.get("started", ""), reverse=True) + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def jobs_stdout_retrieve(id: int, format: str = "txt") -> dict: + """Get the stdout (console output) from a job run. + + Args: + id: Job ID. + format: Output format ('txt' or 'json'). Default 'txt'. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return { + "content": _generate_stdout(job), + "range": {"start": 0, "end": 1}, + } + + +@mcp.tool() +def jobs_job_events_list(id: int, page_size: int = 50) -> dict: + """Get task-level events for a job run. + + Args: + id: Job ID. + page_size: Number of events to return. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + events = _generate_events(job) + return { + "count": len(events), + "next": None, + "previous": None, + "results": events[:page_size], + } + + +@mcp.tool() +def jobs_job_host_summaries_list(id: int) -> dict: + """Get per-host execution summaries for a job run. + + Args: + id: Job ID. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + summaries = _generate_host_summaries(job) + return { + "count": len(summaries), + "next": None, + "previous": None, + "results": summaries, + } + + +@mcp.tool() +def jobs_relaunch_retrieve( + id: int, + hosts: str = "all", + job_type: str = "run", +) -> dict: + """Relaunch a previously completed or failed job. + + Args: + id: Original job ID to relaunch. + hosts: Which hosts to target ('all' or 'failed'). + job_type: Job type for relaunch ('run' or 'check'). + """ + global _next_job_id + original = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not original: + return {"detail": f"Not found. Job {id} does not exist."} + + new_id = _next_job_id + _next_job_id += 1 + + new_job = { + **original, + "id": new_id, + "job_type": job_type, + "status": "successful", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": _ts(timedelta(seconds=-300)), + "elapsed": 300.0, + "launch_type": "relaunch", + } + MOCK_JOBS.append(new_job) + + return { + "job": new_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{new_id}/", + } + + +# --------------------------------------------------------------------------- +# MCP Tools: Inventory Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def inventories_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available inventories in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter inventories. + """ + results = MOCK_INVENTORIES + if search: + s = search.lower() + results = [inv for inv in results if s in inv["name"].lower() or s in inv.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def hosts_list( + inventory_id: Optional[int] = None, + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List hosts in an inventory. + + Args: + inventory_id: Filter by inventory ID. If not provided, lists hosts from all inventories. + page_size: Number of results per page. + search: Optional search string to filter hosts by name. + """ + inv_id = inventory_id or 1 + hosts = _generate_hosts(inv_id) + if search: + s = search.lower() + hosts = [h for h in hosts if s in h["name"].lower()] + return { + "count": len(hosts), + "next": None if len(hosts) <= page_size else f"/api/controller/v2/hosts/?page=2", + "previous": None, + "results": hosts[:page_size], + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__job-template-remediation-validator/instruction.md b/evaluation/with_skills/rh-sre__job-template-remediation-validator/instruction.md new file mode 100644 index 00000000..55b78ca1 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-remediation-validator/instruction.md @@ -0,0 +1,18 @@ +# Job Template Validation Task + +You are a Red Hat SRE. Before running a CVE remediation playbook through AAP, you need to verify that the job template is correctly configured and safe to execute. + +## Scenario +The team wants to use an existing AAP job template to remediate a critical vulnerability. Before giving the green light, you need to confirm the template meets all requirements for a safe remediation run. + +## Requirements +- Retrieve the job template configuration from AAP +- Verify required fields are set: inventory, project, playbook, credentials, and privilege escalation +- Check recommended settings: whether the template prompts for variables, limit, and inventory at launch +- Verify the referenced project and inventory actually exist in AAP +- Produce a pass/warn/fail assessment for each configuration item +- Summarize whether the template is ready for production remediation use + +Document your methodology, validation results, and assessment in `/root/report.md`. + +Use MCP tools to query AAP. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-sre__job-template-remediation-validator/solution/solve.sh b/evaluation/with_skills/rh-sre__job-template-remediation-validator/solution/solve.sh new file mode 100644 index 00000000..6e9ff39d --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-remediation-validator/solution/solve.sh @@ -0,0 +1,21 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Job Template Validation + +## Required Checks +| Field | Expected | Status | +|-------|----------|--------| +| ask_job_type_on_launch | true | ✅ | +| become_enabled | true | ✅ | +| credentials | present | ✅ | +| inventory | present | ✅ | +| project | present | ✅ | +| playbook | present | ✅ | + +## Recommended +- ask_variables_on_launch: true +- ask_limit_on_launch: true + +## Overall +✓ PASSED - Template ready for remediation playbook execution. +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__job-template-remediation-validator/task.toml b/evaluation/with_skills/rh-sre__job-template-remediation-validator/task.toml new file mode 100644 index 00000000..2b6428ba --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-remediation-validator/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__job-template-remediation-validator" +name = "rh-sre Job Template Validation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "job-template-remediation-validator", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__job-template-remediation-validator/tests/llm_judge.py b/evaluation/with_skills/rh-sre__job-template-remediation-validator/tests/llm_judge.py new file mode 100644 index 00000000..106f21c9 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-remediation-validator/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "ask_job_type_required", "file": "/root/report.md", "question": "Does the report require ask_job_type_on_launch: true for dual check/run mode support?", "reference": "A skilled report requires this for dry-run vs run flexibility. An unskilled report doesn't validate this field."}, + {"id": "become_and_credentials", "file": "/root/report.md", "question": "Does the report validate both become_enabled and credentials (checking summary_fields.credentials or credentials array)?", "reference": "A skilled report checks both credential locations. An unskilled report checks only one."}, + {"id": "required_vs_recommended", "file": "/root/report.md", "question": "Does the report distinguish required fields (inventory, project, playbook, credentials, become, ask_job_type) from recommended (ask_variables, ask_limit)?", "reference": "A skilled report categorizes validation checks by priority. An unskilled report treats all checks equally."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__job-template-remediation-validator/tests/test.sh b/evaluation/with_skills/rh-sre__job-template-remediation-validator/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-remediation-validator/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__job-template-remediation-validator/tests/test_outputs.py b/evaluation/with_skills/rh-sre__job-template-remediation-validator/tests/test_outputs.py new file mode 100644 index 00000000..b39c5886 --- /dev/null +++ b/evaluation/with_skills/rh-sre__job-template-remediation-validator/tests/test_outputs.py @@ -0,0 +1,63 @@ +""" +Tests for rh-sre__job-template-remediation-validator per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['valid', 'job template', 'check']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_ask_job_type_on_launch(self): + """Skill teaches ask_job_type_on_launch: true is required for check vs run modes.""" + c = read_report().lower() + assert any(t in c for t in ["ask_job_type", "ask_job_type_on_launch"]), ( + "should require ask_job_type_on_launch (skill: for check vs run)" + ) + + def test_credentials_check_both_fields(self): + """Skill teaches credentials may be in summary_fields.credentials OR credentials array.""" + c = read_report().lower() + assert any(t in c for t in ["summary_fields", "credentials array", "both"]), ( + "should check credentials in summary_fields or credentials array (skill-specific)" + ) + + def test_become_enabled_required(self): + """Skill: become_enabled required for package updates.""" + c = read_report().lower() + assert any(t in c for t in ["become", "privilege", "escalat", "sudo"]), ( + "should require privilege escalation (skill: required for remediation)" + ) + + def test_required_vs_recommended(self): + """Skill: Distinguish required (inventory, project, playbook, credentials, become, ask_job_type) vs recommended (ask_variables, ask_limit).""" + c = read_report().lower() + has_required = any(t in c for t in ["required", "must", "inventory", "project", "playbook"]) + has_recommended = any(t in c for t in ["recommended", "warn", "variable", "limit"]) + assert has_required or has_recommended, ( + "should distinguish required vs recommended checks (skill: Phase 2 vs 3)" + ) diff --git a/evaluation/with_skills/rh-sre__mcp-aap-validator/environment/Dockerfile b/evaluation/with_skills/rh-sre__mcp-aap-validator/environment/Dockerfile new file mode 100644 index 00000000..d5c9e7b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-aap-validator/environment/Dockerfile @@ -0,0 +1,56 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + }, \ + "aap-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-aap-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-aap-mcp.py b/evaluation/with_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-aap-mcp.py new file mode 100644 index 00000000..d8ae4fd5 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-aap-mcp.py @@ -0,0 +1,1048 @@ +#!/usr/bin/env python3 +""" +Mock AAP (Ansible Automation Platform) MCP Server + +Simulates the AAP MCP gateway for per-skill evaluation tasks. Implements +the full set of tools used by rh-sre skills: + - job_templates_list / job_templates_retrieve + - projects_list + - job_templates_launch_retrieve + - jobs_retrieve / jobs_stdout_retrieve + - jobs_job_events_list / jobs_job_host_summaries_list + - jobs_relaunch_retrieve + - inventories_list / hosts_list + +Data mirrors a realistic AAP deployment: + - 6 job templates (3 remediation, 1 compliance, 1 patching, 1 reporting) + - 3 projects (remediation, compliance, reporting) + - 3 inventories (production 30 hosts, staging 15 hosts, all-managed 63 hosts) + - 12 recent jobs with varied statuses + +Follows the same mock-server pattern as mock-lightspeed-mcp.py. +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +mcp = FastMCP("aap-mcp") + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +def _ts(delta: timedelta) -> str: + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +# --------------------------------------------------------------------------- +# Mock data: Projects +# --------------------------------------------------------------------------- + +MOCK_PROJECTS = [ + { + "id": 6, + "type": "project", + "name": "Remediation Playbooks", + "description": "CVE and security remediation playbooks managed via Git", + "scm_type": "git", + "scm_url": "https://github.com/org/remediation-playbooks.git", + "scm_branch": "main", + "scm_revision": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2", + "status": "successful", + "last_job_run": _ts(timedelta(hours=2)), + "last_update_failed": False, + "created": _ts(timedelta(days=90)), + "modified": _ts(timedelta(hours=2)), + }, + { + "id": 7, + "type": "project", + "name": "Compliance Checks", + "description": "STIG and CIS compliance scanning playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/compliance-playbooks.git", + "scm_branch": "main", + "scm_revision": "b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3", + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "last_update_failed": False, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 8, + "type": "project", + "name": "Fleet Reporting", + "description": "System inventory and health reporting playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/fleet-reports.git", + "scm_branch": "main", + "scm_revision": "c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4", + "status": "successful", + "last_job_run": _ts(timedelta(days=3)), + "last_update_failed": False, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=3)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Inventories & Hosts +# --------------------------------------------------------------------------- + +MOCK_INVENTORIES = [ + { + "id": 1, + "type": "inventory", + "name": "Production Systems", + "description": "All production RHEL systems across data centers", + "total_hosts": 30, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 5, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 2, + "type": "inventory", + "name": "Staging Systems", + "description": "Pre-production staging environment", + "total_hosts": 15, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 3, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=300)), + "modified": _ts(timedelta(days=7)), + }, + { + "id": 3, + "type": "inventory", + "name": "All Managed Systems", + "description": "Complete fleet: production, staging, development, QA, legacy", + "total_hosts": 63, + "has_active_failures": True, + "hosts_with_active_failures": 2, + "total_groups": 8, + "groups_with_active_failures": 1, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(hours=6)), + }, +] + + +def _generate_hosts(inventory_id: int) -> list[dict]: + """Generate realistic hosts for an inventory.""" + hosts: list[dict] = [] + if inventory_id == 1: + roles = ["web", "db", "app", "lb", "monitoring", "cache"] + for i, role in enumerate(roles): + for j in range(5 if role in ("web", "app") else 4 if role == "db" else 3 if role == "monitoring" else 2): + hosts.append({ + "id": len(hosts) + 1, + "type": "host", + "name": f"{role}-{j+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "production", "role": "{role}"}}', + }) + if len(hosts) >= 30: + break + if len(hosts) >= 30: + break + elif inventory_id == 2: + for i in range(15): + role = ["web", "db", "app"][i % 3] + hosts.append({ + "id": 100 + i, + "type": "host", + "name": f"{role}-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging", "role": "{role}"}}', + }) + elif inventory_id == 3: + for i in range(30): + hosts.append({ + "id": 200 + i, + "type": "host", + "name": f"host-{i+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": i in (45, 58), + "variables": f'{{"rhel_version": "9.3", "environment": "production"}}', + }) + for i in range(15): + hosts.append({ + "id": 230 + i, + "type": "host", + "name": f"host-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging"}}', + }) + for i in range(10): + hosts.append({ + "id": 245 + i, + "type": "host", + "name": f"dev-{i+1:02d}.dev.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "8.9", "environment": "development"}}', + }) + for i in range(5): + hosts.append({ + "id": 255 + i, + "type": "host", + "name": f"qa-{i+1:02d}.qa.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.2", "environment": "qa"}}', + }) + for i in range(3): + hosts.append({ + "id": 260 + i, + "type": "host", + "name": f"legacy-{i+1:02d}.corp.example.com", + "inventory": inventory_id, + "enabled": i < 2, + "has_active_failures": i == 2, + "variables": f'{{"rhel_version": "7.9", "environment": "legacy"}}', + }) + return hosts + + +# --------------------------------------------------------------------------- +# Mock data: Job Templates +# --------------------------------------------------------------------------- + +MOCK_JOB_TEMPLATES = [ + { + "id": 10, + "type": "job_template", + "name": "CVE Remediation - Kernel Update", + "description": "Kernel update with boom snapshot for rollback safety", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "successful", + "last_job_run": _ts(timedelta(hours=4)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1001, "status": "successful", "finished": _ts(timedelta(hours=4))}, + }, + "created": _ts(timedelta(days=60)), + "modified": _ts(timedelta(days=2)), + }, + { + "id": 11, + "type": "job_template", + "name": "CVE Remediation - Package Update", + "description": "General package update for CVE remediation with needs-restarting check", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "check", + "verbosity": 1, + "timeout": 1800, + "forks": 10, + "status": "successful", + "last_job_run": _ts(timedelta(hours=12)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1005, "status": "successful", "finished": _ts(timedelta(hours=12))}, + }, + "created": _ts(timedelta(days=45)), + "modified": _ts(timedelta(days=5)), + }, + { + "id": 12, + "type": "job_template", + "name": "CVE Remediation - Generic", + "description": "Generic CVE remediation template for ad-hoc patches", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-remediation.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "never updated", + "last_job_run": None, + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + }, + "created": _ts(timedelta(days=30)), + "modified": _ts(timedelta(days=30)), + }, + { + "id": 20, + "type": "job_template", + "name": "Compliance Check - STIG", + "description": "Run STIG compliance scan across fleet", + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": False, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 7200, + "forks": 20, + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "summary_fields": { + "project": {"id": 7, "name": "Compliance Checks", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 2, "name": "compliance-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1010, "status": "successful", "finished": _ts(timedelta(days=1))}, + }, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=14)), + }, + { + "id": 25, + "type": "job_template", + "name": "Emergency Patching", + "description": "Emergency patch application — NO become enabled (misconfigured)", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": False, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 600, + "forks": 25, + "status": "failed", + "last_job_run": _ts(timedelta(days=7)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1020, "status": "failed", "finished": _ts(timedelta(days=7))}, + }, + "created": _ts(timedelta(days=200)), + "modified": _ts(timedelta(days=200)), + }, + { + "id": 30, + "type": "job_template", + "name": "Fleet Health Report", + "description": "Generate fleet health and inventory report", + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": True, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 1800, + "forks": 30, + "status": "successful", + "last_job_run": _ts(timedelta(hours=6)), + "summary_fields": { + "project": {"id": 8, "name": "Fleet Reporting", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1025, "status": "successful", "finished": _ts(timedelta(hours=6))}, + }, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=14)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Jobs (recent runs) +# --------------------------------------------------------------------------- + +PROD_HOSTS = [ + "web-01.prod.example.com", + "web-02.prod.example.com", + "db-01.prod.example.com", + "db-02.prod.example.com", + "app-01.prod.example.com", + "app-02.prod.example.com", +] + +MOCK_JOBS = [ + { + "id": 1001, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "check", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=4, minutes=30)), + "finished": _ts(timedelta(hours=4)), + "elapsed": 1800.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1002, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=3, minutes=45)), + "finished": _ts(timedelta(hours=3)), + "elapsed": 2700.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1005, + "type": "job", + "name": "CVE Remediation - Package Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=12, minutes=20)), + "finished": _ts(timedelta(hours=12)), + "elapsed": 1200.0, + "job_template": 11, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "limit": "", + "extra_vars": '{"target_cve": "CVE-2024-54321"}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 11, "name": "CVE Remediation - Package Update"}, + }, + }, + { + "id": 1010, + "type": "job", + "name": "Compliance Check - STIG", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(days=1, hours=2)), + "finished": _ts(timedelta(days=1)), + "elapsed": 7200.0, + "job_template": 20, + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 20, "name": "Compliance Check - STIG"}, + }, + }, + { + "id": 1020, + "type": "job", + "name": "Emergency Patching", + "job_type": "run", + "status": "failed", + "failed": True, + "started": _ts(timedelta(days=7, hours=1)), + "finished": _ts(timedelta(days=7)), + "elapsed": 3600.0, + "job_template": 25, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 25, "name": "Emergency Patching"}, + }, + }, + { + "id": 1025, + "type": "job", + "name": "Fleet Health Report", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=6, minutes=30)), + "finished": _ts(timedelta(hours=6)), + "elapsed": 1800.0, + "job_template": 30, + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 30, "name": "Fleet Health Report"}, + }, + }, +] + +_next_job_id = 2000 + + +# --------------------------------------------------------------------------- +# Mock stdout generators +# --------------------------------------------------------------------------- + +def _generate_stdout(job: dict) -> str: + """Generate realistic Ansible playbook stdout for a job.""" + playbook_name = job.get("name", "Unknown") + job_type = job.get("job_type", "run") + status = job.get("status", "successful") + limit = job.get("limit", "") + hosts = limit.split(",") if limit else PROD_HOSTS[:3] + hosts = [h.strip() for h in hosts if h.strip()] + extra_vars = job.get("extra_vars", "{}") + mode = " (CHECK MODE)" if job_type == "check" else "" + + lines = [] + lines.append(f"PLAY [{playbook_name}] *****") + lines.append("") + + lines.append(f"TASK [Gathering Facts{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}]") + lines.append("") + + if "kernel" in playbook_name.lower(): + lines.append(f"TASK [Create boom snapshot for rollback{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}] => {{\"msg\": \"boom create --title pre-remediation-CVE-2024-12345\"}}") + lines.append("") + + lines.append(f"TASK [Check disk space for kernel update{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}] => {{\"msg\": \"Disk space OK: 45% used\"}}") + lines.append("") + + lines.append(f"TASK [Update kernel package{mode}] *****") + for h in hosts: + result = "changed" if status == "successful" else "fatal" + if result == "changed": + lines.append(f'changed: [{h}] => {{"msg": "kernel-5.14.0-362.24.1.el9_3 -> kernel-5.14.0-362.24.2.el9_3"}}') + else: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Permission denied", "rc": 1}}') + lines.append("") + + lines.append(f"TASK [Check if reboot is needed (needs-restarting -r){mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"rc": 1, "msg": "Reboot is required to fully utilize updates."}}') + lines.append("") + + elif "package" in playbook_name.lower(): + lines.append(f"TASK [Update target packages for CVE remediation{mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"msg": "httpd-2.4.53-7.el9 -> httpd-2.4.57-8.el9"}}') + lines.append("") + + lines.append(f"TASK [Restart affected services{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append(f"TASK [Verify service health{mode}] *****") + for h in hosts: + lines.append(f'ok: [{h}] => {{"msg": "Service httpd is running"}}') + lines.append("") + + elif "emergency" in playbook_name.lower() and status == "failed": + lines.append(f"TASK [Apply emergency patch{mode}] *****") + for h in hosts: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Missing sudo password (become_enabled not set)", "rc": 1}}') + lines.append("") + lines.append("NO MORE HOSTS LEFT *****") + lines.append("") + + else: + lines.append(f"TASK [Execute playbook tasks{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append("PLAY RECAP *****") + for h in hosts: + if status == "successful": + ok_count = random.randint(3, 6) + changed_count = random.randint(1, 3) + lines.append(f"{h:<45} : ok={ok_count} changed={changed_count} unreachable=0 failed=0 skipped=0 rescued=0 ignored=0") + elif status == "failed": + lines.append(f"{h:<45} : ok=1 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0") + lines.append("") + + return "\n".join(lines) + + +def _generate_events(job: dict) -> list[dict]: + """Generate realistic Ansible task events for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + events: list[dict] = [] + eid = 1 + + task_names = ["Gathering Facts"] + if "kernel" in job.get("name", "").lower(): + task_names += [ + "Create boom snapshot for rollback", + "Check disk space for kernel update", + "Update kernel package", + "Check if reboot is needed (needs-restarting -r)", + ] + elif "package" in job.get("name", "").lower(): + task_names += [ + "Update target packages for CVE remediation", + "Restart affected services", + "Verify service health", + ] + else: + task_names += ["Execute playbook tasks"] + + for task_name in task_names: + for host in hosts: + is_failed = job.get("status") == "failed" and task_name != "Gathering Facts" + events.append({ + "id": eid, + "type": "job_event", + "event": "runner_on_ok" if not is_failed else "runner_on_failed", + "task": task_name, + "host": host, + "host_name": host, + "play": job.get("name", ""), + "changed": task_name != "Gathering Facts" and not is_failed, + "failed": is_failed, + "event_data": { + "task": task_name, + "host": host, + "res": { + "changed": task_name != "Gathering Facts" and not is_failed, + "msg": "Task completed" if not is_failed else "Permission denied", + }, + }, + "created": _ts(timedelta(hours=4, minutes=30 - eid)), + }) + eid += 1 + + return events + + +def _generate_host_summaries(job: dict) -> list[dict]: + """Generate per-host summaries for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + summaries: list[dict] = [] + + for i, host in enumerate(hosts): + is_failed = job.get("status") == "failed" + summaries.append({ + "id": i + 1, + "type": "job_host_summary", + "host": i + 1, + "host_name": host, + "ok": 1 if is_failed else random.randint(3, 6), + "changed": 0 if is_failed else random.randint(1, 3), + "dark": 0, + "failures": 1 if is_failed else 0, + "skipped": 0, + "processed": 1, + "failed": is_failed, + }) + + return summaries + + +# --------------------------------------------------------------------------- +# MCP Tools: Job Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def job_templates_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available job templates in AAP. + + Args: + page_size: Number of results per page (default 10, max 200). + search: Optional search string to filter templates by name. + """ + results = MOCK_JOB_TEMPLATES + if search: + s = search.lower() + results = [t for t in results if s in t["name"].lower() or s in t.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_retrieve(id: str) -> dict: + """Retrieve detailed information about a specific job template. + + Args: + id: Job template ID (as string). + """ + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + return template + + +@mcp.tool() +def projects_list( + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List available projects in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter projects by name. + """ + results = MOCK_PROJECTS + if search: + s = search.lower() + results = [p for p in results if s in p["name"].lower() or s in p.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_launch_retrieve( + id: str, + requestBody: Optional[dict] = None, +) -> dict: + """Launch a job from a job template. + + Args: + id: Job template ID to launch. + requestBody: Optional launch parameters including job_type ('run' or 'check'), + extra_vars (dict), and limit (comma-separated host list). + """ + global _next_job_id + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + + body = requestBody or {} + job_type = body.get("job_type", template.get("job_type", "run")) + + if not template.get("ask_job_type_on_launch") and job_type != template.get("job_type"): + return { + "error": f"Cannot override job_type: ask_job_type_on_launch is disabled on template {id}", + } + + job_id = _next_job_id + _next_job_id += 1 + + new_job = { + "id": job_id, + "type": "job", + "name": template["name"], + "job_type": job_type, + "status": "pending", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": None, + "elapsed": 0.0, + "job_template": tid, + "inventory": template["inventory"], + "project": template["project"], + "playbook": template["playbook"], + "limit": body.get("limit", ""), + "extra_vars": str(body.get("extra_vars", {})), + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": tid, "name": template["name"]}, + }, + } + MOCK_JOBS.append(new_job) + + # Simulate job completion after launch + new_job["status"] = "successful" + new_job["finished"] = _ts(timedelta(seconds=-300)) + new_job["elapsed"] = 300.0 + + return { + "job": job_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{job_id}/", + "related": { + "stdout": f"/api/controller/v2/jobs/{job_id}/stdout/", + "job_events": f"/api/controller/v2/jobs/{job_id}/job_events/", + "job_host_summaries": f"/api/controller/v2/jobs/{job_id}/job_host_summaries/", + }, + } + + +@mcp.tool() +def jobs_retrieve(id: int) -> dict: + """Get the status and details of a job run. + + Args: + id: Job ID to retrieve. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return job + + +@mcp.tool() +def jobs_list(page_size: int = 10) -> dict: + """List recent job runs. + + Args: + page_size: Number of results to return. + """ + results = sorted(MOCK_JOBS, key=lambda j: j.get("started", ""), reverse=True) + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def jobs_stdout_retrieve(id: int, format: str = "txt") -> dict: + """Get the stdout (console output) from a job run. + + Args: + id: Job ID. + format: Output format ('txt' or 'json'). Default 'txt'. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return { + "content": _generate_stdout(job), + "range": {"start": 0, "end": 1}, + } + + +@mcp.tool() +def jobs_job_events_list(id: int, page_size: int = 50) -> dict: + """Get task-level events for a job run. + + Args: + id: Job ID. + page_size: Number of events to return. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + events = _generate_events(job) + return { + "count": len(events), + "next": None, + "previous": None, + "results": events[:page_size], + } + + +@mcp.tool() +def jobs_job_host_summaries_list(id: int) -> dict: + """Get per-host execution summaries for a job run. + + Args: + id: Job ID. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + summaries = _generate_host_summaries(job) + return { + "count": len(summaries), + "next": None, + "previous": None, + "results": summaries, + } + + +@mcp.tool() +def jobs_relaunch_retrieve( + id: int, + hosts: str = "all", + job_type: str = "run", +) -> dict: + """Relaunch a previously completed or failed job. + + Args: + id: Original job ID to relaunch. + hosts: Which hosts to target ('all' or 'failed'). + job_type: Job type for relaunch ('run' or 'check'). + """ + global _next_job_id + original = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not original: + return {"detail": f"Not found. Job {id} does not exist."} + + new_id = _next_job_id + _next_job_id += 1 + + new_job = { + **original, + "id": new_id, + "job_type": job_type, + "status": "successful", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": _ts(timedelta(seconds=-300)), + "elapsed": 300.0, + "launch_type": "relaunch", + } + MOCK_JOBS.append(new_job) + + return { + "job": new_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{new_id}/", + } + + +# --------------------------------------------------------------------------- +# MCP Tools: Inventory Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def inventories_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available inventories in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter inventories. + """ + results = MOCK_INVENTORIES + if search: + s = search.lower() + results = [inv for inv in results if s in inv["name"].lower() or s in inv.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def hosts_list( + inventory_id: Optional[int] = None, + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List hosts in an inventory. + + Args: + inventory_id: Filter by inventory ID. If not provided, lists hosts from all inventories. + page_size: Number of results per page. + search: Optional search string to filter hosts by name. + """ + inv_id = inventory_id or 1 + hosts = _generate_hosts(inv_id) + if search: + s = search.lower() + hosts = [h for h in hosts if s in h["name"].lower()] + return { + "count": len(hosts), + "next": None if len(hosts) <= page_size else f"/api/controller/v2/hosts/?page=2", + "previous": None, + "results": hosts[:page_size], + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__mcp-aap-validator/instruction.md b/evaluation/with_skills/rh-sre__mcp-aap-validator/instruction.md new file mode 100644 index 00000000..54d1a0e6 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-aap-validator/instruction.md @@ -0,0 +1,16 @@ +# AAP Connectivity Check Task + +You are a Red Hat SRE. Before starting a remediation workflow that depends on Ansible Automation Platform, you need to verify that the AAP integration is working correctly. + +## Scenario +You are about to run a remediation workflow that uses AAP to execute playbooks. First, you need to confirm that the AAP connection is healthy and that you can access the necessary resources. + +## Requirements +- Test connectivity to the AAP server by querying job templates and inventories +- Verify that the response is valid and contains expected data +- If any connection fails, document the error and provide troubleshooting guidance (credentials, network, SSL, permissions) +- Report the overall AAP readiness status: which capabilities are available and which are not + +Document your methodology, connectivity check results, and troubleshooting findings in `/root/report.md`. + +Use MCP tools to interact with AAP. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-sre__mcp-aap-validator/solution/solve.sh b/evaluation/with_skills/rh-sre__mcp-aap-validator/solution/solve.sh new file mode 100644 index 00000000..88542def --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-aap-validator/solution/solve.sh @@ -0,0 +1,25 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# AAP MCP Validation + +## Test Calls +- `job_templates_list(page_size: 10)` from aap-mcp-job-management ✅ +- `inventories_list(page_size: 10)` from aap-mcp-inventory-management ✅ + +## Result +| Server | Outcome | +|--------|---------| +| aap-mcp-job-management | ✅ PASSED | +| aap-mcp-inventory-management | ✅ PASSED | + +## Diagnostics +| Code | Meaning | +|------|---------| +| 401 | Token expired or invalid → regenerate in AAP Web UI → Users → Tokens | +| 403 | Insufficient RBAC (need Job Templates, Inventories) | +| 404 | Wrong URL — AAP_MCP_SERVER must point to MCP gateway, not main AAP UI | + +## Environment +- AAP_MCP_SERVER: Set (must point to MCP gateway) +- AAP_API_TOKEN: Set +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__mcp-aap-validator/task.toml b/evaluation/with_skills/rh-sre__mcp-aap-validator/task.toml new file mode 100644 index 00000000..aad389ea --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-aap-validator/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__mcp-aap-validator" +name = "rh-sre AAP MCP Validation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "mcp-aap-validator", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__mcp-aap-validator/tests/llm_judge.py b/evaluation/with_skills/rh-sre__mcp-aap-validator/tests/llm_judge.py new file mode 100644 index 00000000..474598a6 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-aap-validator/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "gateway_vs_ui_url", "file": "/root/report.md", "question": "Does the report note that AAP_MCP_SERVER must point to the MCP gateway endpoint, not the main AAP UI URL, and that 404 indicates wrong URL?", "reference": "A skilled report explains the gateway/UI URL distinction and maps 404 to wrong URL. An unskilled report doesn't distinguish these endpoints."}, + {"id": "both_servers_tested", "file": "/root/report.md", "question": "Does the report test both job_templates_list and inventories_list for AAP MCP validation?", "reference": "A skilled report validates both MCP servers. An unskilled report tests only one."}, + {"id": "structured_outcome", "file": "/root/report.md", "question": "Does the report present per-server validation outcomes (PASSED/FAILED/PARTIAL) in table format?", "reference": "A skilled report uses structured table with per-server status. An unskilled report uses unstructured text."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__mcp-aap-validator/tests/test.sh b/evaluation/with_skills/rh-sre__mcp-aap-validator/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-aap-validator/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__mcp-aap-validator/tests/test_outputs.py b/evaluation/with_skills/rh-sre__mcp-aap-validator/tests/test_outputs.py new file mode 100644 index 00000000..615713b5 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-aap-validator/tests/test_outputs.py @@ -0,0 +1,66 @@ +""" +Tests for rh-sre__mcp-aap-validator per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['aap', 'mcp', 'valid', 'connect']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_both_servers_tested(self): + """Skill: Test BOTH job_templates_list (job-management) AND inventories_list (inventory-management).""" + c = read_report().lower() + has_job = any(t in c for t in ["job_template", "job template", "job-management"]) + has_inv = any(t in c for t in ["inventor", "inventory-management"]) + assert has_job or has_inv, ( + "should test both AAP MCP servers (skill: job-management + inventory-management)" + ) + + def test_mcp_gateway_not_ui(self): + """Skill teaches AAP_MCP_SERVER must point to MCP gateway endpoint, not main AAP UI URL.""" + c = read_report().lower() + assert ("gateway" in c and "mcp" in c) or "aap_mcp_server" in c, ( + "should note AAP_MCP_SERVER must point to MCP gateway, not UI (skill: wrong URL = 404)" + ) + + def test_404_wrong_url(self): + """Skill teaches HTTP 404 = wrong AAP_MCP_SERVER URL.""" + c = read_report().lower() + assert "404" in c and any(t in c for t in ["url", "wrong"]), ( + "should explain 404 indicates wrong URL (skill: troubleshooting)" + ) + + def test_table_format(self): + """Skill: Output table with Server | Outcome (PASSED/FAILED/PARTIAL).""" + content = read_report() + c = content.lower() + has_table = "|" in content + has_outcome = any(t in c for t in ["passed", "failed", "partial", "job_templates_list", "inventories_list"]) + assert has_table or has_outcome, ( + "should use table format with outcome (skill: Report Format)" + ) diff --git a/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/environment/Dockerfile b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/environment/Dockerfile new file mode 100644 index 00000000..484ebb33 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/environment/Dockerfile @@ -0,0 +1,52 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/instruction.md b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/instruction.md new file mode 100644 index 00000000..37d450b8 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/instruction.md @@ -0,0 +1,16 @@ +# Lightspeed Connectivity Check Task + +You are a Red Hat SRE. Before querying CVE data or generating remediation playbooks, you need to verify that the Red Hat Insights/Lightspeed integration is working correctly. + +## Scenario +You are about to start a CVE investigation that depends on querying vulnerability data from Red Hat Insights. First, you need to confirm the Lightspeed connection is healthy and returning valid data. + +## Requirements +- Test connectivity to the Lightspeed service by querying CVE data +- Verify the response is valid and contains expected vulnerability information +- If the connection fails, document the error and provide troubleshooting guidance (expired tokens, credentials, network issues, server availability) +- Report the overall Lightspeed readiness status + +Document your methodology, connectivity check results, and troubleshooting findings in `/root/report.md`. + +Use MCP tools to interact with the Lightspeed service. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/solution/solve.sh b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/solution/solve.sh new file mode 100644 index 00000000..8336f1ee --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/solution/solve.sh @@ -0,0 +1,29 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Lightspeed MCP Validation + +## Test: Call vulnerability__get_cves with no parameters +- Do NOT pass `limit` parameter (serialization issue: `limit` → `limit_`) +- Default limit=10 is applied automatically + +## Result +| Server | Outcome | +|--------|---------| +| lightspeed-mcp | ✅ PASSED | + +## Failure Root Causes (when connection fails) +- **Credentials**: LIGHTSPEED_CLIENT_ID or LIGHTSPEED_CLIENT_SECRET not set or invalid +- **Expired credentials**: Red Hat Console tokens may have expired +- **Server not running**: MCP server/container may be stopped +- **Network**: Firewall or proxy blocking console.redhat.com +- **Configuration**: .mcp.json misconfigured or server not registered + +## Troubleshooting +1. Verify env vars: LIGHTSPEED_CLIENT_ID, LIGHTSPEED_CLIENT_SECRET (never echo values) +2. Check credentials at: https://console.redhat.com/settings/integrations +3. Restart MCP server or host after config changes + +## Environment +- LIGHTSPEED_CLIENT_ID: Set +- LIGHTSPEED_CLIENT_SECRET: Set +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/task.toml b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/task.toml new file mode 100644 index 00000000..1e356701 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__mcp-lightspeed-validator" +name = "rh-sre Lightspeed MCP Validation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "mcp-lightspeed-validator", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/tests/llm_judge.py b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/tests/llm_judge.py new file mode 100644 index 00000000..905e9250 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "no_params_get_cves", "file": "/root/report.md", "question": "Does the report call get_cves with no parameters (due to limit_ serialization bug)?", "reference": "A skilled report avoids passing limit parameter. An unskilled report passes limit which may break the call."}, + {"id": "credential_handling", "file": "/root/report.md", "question": "Does the report reference LIGHTSPEED_CLIENT_ID/CLIENT_SECRET env vars and warn against echoing credentials?", "reference": "A skilled report identifies the correct env vars and warns about credential exposure. An unskilled report doesn't know the specific variable names."}, + {"id": "validation_structure", "file": "/root/report.md", "question": "Does the report present Lightspeed MCP validation in structured table format?", "reference": "A skilled report uses table with PASSED/FAILED outcome. An unskilled report uses unstructured text."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/tests/test.sh b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/tests/test_outputs.py b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/tests/test_outputs.py new file mode 100644 index 00000000..05e6bf9b --- /dev/null +++ b/evaluation/with_skills/rh-sre__mcp-lightspeed-validator/tests/test_outputs.py @@ -0,0 +1,64 @@ +""" +Tests for rh-sre__mcp-lightspeed-validator per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['lightspeed', 'mcp', 'valid']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_get_cves_no_params(self): + """Skill: Call vulnerability__get_cves with NO parameters (limit causes limit_ serialization bug).""" + c = read_report().lower() + assert any(t in c for t in ["no param", "without param", "limit_"]), ( + "should call get_cves without parameters (skill: passing limit breaks some clients)" + ) + + def test_lightspeed_credentials(self): + """Skill: LIGHTSPEED_CLIENT_ID + LIGHTSPEED_CLIENT_SECRET are the env vars.""" + c = read_report().lower() + assert any(t in c for t in ["lightspeed_client_id", "client_id", "client_secret"]), ( + "should reference Lightspeed credential env vars (skill: LIGHTSPEED_CLIENT_ID/SECRET)" + ) + + def test_never_echo_credentials(self): + """Skill: Never echo or log credential values.""" + c = read_report().lower() + has_security = any(t in c for t in ["never echo", "do not echo", "redact", "sensitive", "protect"]) + assert has_security or "credential" in c, ( + "should address credential handling (skill: never echo values)" + ) + + def test_table_format(self): + """Skill: Output table with Server | Outcome.""" + c = read_report().lower() + has_table = "|" in read_report() + has_outcome = any(t in c for t in ["passed", "failed", "get_cves", "lightspeed"]) + assert has_table or has_outcome, ( + "should use table format (skill: Report Format)" + ) diff --git a/evaluation/with_skills/rh-sre__playbook-executor/environment/Dockerfile b/evaluation/with_skills/rh-sre__playbook-executor/environment/Dockerfile new file mode 100644 index 00000000..d5c9e7b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-executor/environment/Dockerfile @@ -0,0 +1,56 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + }, \ + "aap-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-aap-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-aap-mcp.py b/evaluation/with_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-aap-mcp.py new file mode 100644 index 00000000..d8ae4fd5 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-aap-mcp.py @@ -0,0 +1,1048 @@ +#!/usr/bin/env python3 +""" +Mock AAP (Ansible Automation Platform) MCP Server + +Simulates the AAP MCP gateway for per-skill evaluation tasks. Implements +the full set of tools used by rh-sre skills: + - job_templates_list / job_templates_retrieve + - projects_list + - job_templates_launch_retrieve + - jobs_retrieve / jobs_stdout_retrieve + - jobs_job_events_list / jobs_job_host_summaries_list + - jobs_relaunch_retrieve + - inventories_list / hosts_list + +Data mirrors a realistic AAP deployment: + - 6 job templates (3 remediation, 1 compliance, 1 patching, 1 reporting) + - 3 projects (remediation, compliance, reporting) + - 3 inventories (production 30 hosts, staging 15 hosts, all-managed 63 hosts) + - 12 recent jobs with varied statuses + +Follows the same mock-server pattern as mock-lightspeed-mcp.py. +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +mcp = FastMCP("aap-mcp") + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +def _ts(delta: timedelta) -> str: + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +# --------------------------------------------------------------------------- +# Mock data: Projects +# --------------------------------------------------------------------------- + +MOCK_PROJECTS = [ + { + "id": 6, + "type": "project", + "name": "Remediation Playbooks", + "description": "CVE and security remediation playbooks managed via Git", + "scm_type": "git", + "scm_url": "https://github.com/org/remediation-playbooks.git", + "scm_branch": "main", + "scm_revision": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2", + "status": "successful", + "last_job_run": _ts(timedelta(hours=2)), + "last_update_failed": False, + "created": _ts(timedelta(days=90)), + "modified": _ts(timedelta(hours=2)), + }, + { + "id": 7, + "type": "project", + "name": "Compliance Checks", + "description": "STIG and CIS compliance scanning playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/compliance-playbooks.git", + "scm_branch": "main", + "scm_revision": "b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3", + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "last_update_failed": False, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 8, + "type": "project", + "name": "Fleet Reporting", + "description": "System inventory and health reporting playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/fleet-reports.git", + "scm_branch": "main", + "scm_revision": "c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4", + "status": "successful", + "last_job_run": _ts(timedelta(days=3)), + "last_update_failed": False, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=3)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Inventories & Hosts +# --------------------------------------------------------------------------- + +MOCK_INVENTORIES = [ + { + "id": 1, + "type": "inventory", + "name": "Production Systems", + "description": "All production RHEL systems across data centers", + "total_hosts": 30, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 5, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 2, + "type": "inventory", + "name": "Staging Systems", + "description": "Pre-production staging environment", + "total_hosts": 15, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 3, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=300)), + "modified": _ts(timedelta(days=7)), + }, + { + "id": 3, + "type": "inventory", + "name": "All Managed Systems", + "description": "Complete fleet: production, staging, development, QA, legacy", + "total_hosts": 63, + "has_active_failures": True, + "hosts_with_active_failures": 2, + "total_groups": 8, + "groups_with_active_failures": 1, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(hours=6)), + }, +] + + +def _generate_hosts(inventory_id: int) -> list[dict]: + """Generate realistic hosts for an inventory.""" + hosts: list[dict] = [] + if inventory_id == 1: + roles = ["web", "db", "app", "lb", "monitoring", "cache"] + for i, role in enumerate(roles): + for j in range(5 if role in ("web", "app") else 4 if role == "db" else 3 if role == "monitoring" else 2): + hosts.append({ + "id": len(hosts) + 1, + "type": "host", + "name": f"{role}-{j+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "production", "role": "{role}"}}', + }) + if len(hosts) >= 30: + break + if len(hosts) >= 30: + break + elif inventory_id == 2: + for i in range(15): + role = ["web", "db", "app"][i % 3] + hosts.append({ + "id": 100 + i, + "type": "host", + "name": f"{role}-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging", "role": "{role}"}}', + }) + elif inventory_id == 3: + for i in range(30): + hosts.append({ + "id": 200 + i, + "type": "host", + "name": f"host-{i+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": i in (45, 58), + "variables": f'{{"rhel_version": "9.3", "environment": "production"}}', + }) + for i in range(15): + hosts.append({ + "id": 230 + i, + "type": "host", + "name": f"host-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging"}}', + }) + for i in range(10): + hosts.append({ + "id": 245 + i, + "type": "host", + "name": f"dev-{i+1:02d}.dev.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "8.9", "environment": "development"}}', + }) + for i in range(5): + hosts.append({ + "id": 255 + i, + "type": "host", + "name": f"qa-{i+1:02d}.qa.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.2", "environment": "qa"}}', + }) + for i in range(3): + hosts.append({ + "id": 260 + i, + "type": "host", + "name": f"legacy-{i+1:02d}.corp.example.com", + "inventory": inventory_id, + "enabled": i < 2, + "has_active_failures": i == 2, + "variables": f'{{"rhel_version": "7.9", "environment": "legacy"}}', + }) + return hosts + + +# --------------------------------------------------------------------------- +# Mock data: Job Templates +# --------------------------------------------------------------------------- + +MOCK_JOB_TEMPLATES = [ + { + "id": 10, + "type": "job_template", + "name": "CVE Remediation - Kernel Update", + "description": "Kernel update with boom snapshot for rollback safety", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "successful", + "last_job_run": _ts(timedelta(hours=4)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1001, "status": "successful", "finished": _ts(timedelta(hours=4))}, + }, + "created": _ts(timedelta(days=60)), + "modified": _ts(timedelta(days=2)), + }, + { + "id": 11, + "type": "job_template", + "name": "CVE Remediation - Package Update", + "description": "General package update for CVE remediation with needs-restarting check", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "check", + "verbosity": 1, + "timeout": 1800, + "forks": 10, + "status": "successful", + "last_job_run": _ts(timedelta(hours=12)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1005, "status": "successful", "finished": _ts(timedelta(hours=12))}, + }, + "created": _ts(timedelta(days=45)), + "modified": _ts(timedelta(days=5)), + }, + { + "id": 12, + "type": "job_template", + "name": "CVE Remediation - Generic", + "description": "Generic CVE remediation template for ad-hoc patches", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-remediation.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "never updated", + "last_job_run": None, + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + }, + "created": _ts(timedelta(days=30)), + "modified": _ts(timedelta(days=30)), + }, + { + "id": 20, + "type": "job_template", + "name": "Compliance Check - STIG", + "description": "Run STIG compliance scan across fleet", + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": False, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 7200, + "forks": 20, + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "summary_fields": { + "project": {"id": 7, "name": "Compliance Checks", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 2, "name": "compliance-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1010, "status": "successful", "finished": _ts(timedelta(days=1))}, + }, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=14)), + }, + { + "id": 25, + "type": "job_template", + "name": "Emergency Patching", + "description": "Emergency patch application — NO become enabled (misconfigured)", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": False, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 600, + "forks": 25, + "status": "failed", + "last_job_run": _ts(timedelta(days=7)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1020, "status": "failed", "finished": _ts(timedelta(days=7))}, + }, + "created": _ts(timedelta(days=200)), + "modified": _ts(timedelta(days=200)), + }, + { + "id": 30, + "type": "job_template", + "name": "Fleet Health Report", + "description": "Generate fleet health and inventory report", + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": True, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 1800, + "forks": 30, + "status": "successful", + "last_job_run": _ts(timedelta(hours=6)), + "summary_fields": { + "project": {"id": 8, "name": "Fleet Reporting", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1025, "status": "successful", "finished": _ts(timedelta(hours=6))}, + }, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=14)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Jobs (recent runs) +# --------------------------------------------------------------------------- + +PROD_HOSTS = [ + "web-01.prod.example.com", + "web-02.prod.example.com", + "db-01.prod.example.com", + "db-02.prod.example.com", + "app-01.prod.example.com", + "app-02.prod.example.com", +] + +MOCK_JOBS = [ + { + "id": 1001, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "check", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=4, minutes=30)), + "finished": _ts(timedelta(hours=4)), + "elapsed": 1800.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1002, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=3, minutes=45)), + "finished": _ts(timedelta(hours=3)), + "elapsed": 2700.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1005, + "type": "job", + "name": "CVE Remediation - Package Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=12, minutes=20)), + "finished": _ts(timedelta(hours=12)), + "elapsed": 1200.0, + "job_template": 11, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "limit": "", + "extra_vars": '{"target_cve": "CVE-2024-54321"}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 11, "name": "CVE Remediation - Package Update"}, + }, + }, + { + "id": 1010, + "type": "job", + "name": "Compliance Check - STIG", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(days=1, hours=2)), + "finished": _ts(timedelta(days=1)), + "elapsed": 7200.0, + "job_template": 20, + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 20, "name": "Compliance Check - STIG"}, + }, + }, + { + "id": 1020, + "type": "job", + "name": "Emergency Patching", + "job_type": "run", + "status": "failed", + "failed": True, + "started": _ts(timedelta(days=7, hours=1)), + "finished": _ts(timedelta(days=7)), + "elapsed": 3600.0, + "job_template": 25, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 25, "name": "Emergency Patching"}, + }, + }, + { + "id": 1025, + "type": "job", + "name": "Fleet Health Report", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=6, minutes=30)), + "finished": _ts(timedelta(hours=6)), + "elapsed": 1800.0, + "job_template": 30, + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 30, "name": "Fleet Health Report"}, + }, + }, +] + +_next_job_id = 2000 + + +# --------------------------------------------------------------------------- +# Mock stdout generators +# --------------------------------------------------------------------------- + +def _generate_stdout(job: dict) -> str: + """Generate realistic Ansible playbook stdout for a job.""" + playbook_name = job.get("name", "Unknown") + job_type = job.get("job_type", "run") + status = job.get("status", "successful") + limit = job.get("limit", "") + hosts = limit.split(",") if limit else PROD_HOSTS[:3] + hosts = [h.strip() for h in hosts if h.strip()] + extra_vars = job.get("extra_vars", "{}") + mode = " (CHECK MODE)" if job_type == "check" else "" + + lines = [] + lines.append(f"PLAY [{playbook_name}] *****") + lines.append("") + + lines.append(f"TASK [Gathering Facts{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}]") + lines.append("") + + if "kernel" in playbook_name.lower(): + lines.append(f"TASK [Create boom snapshot for rollback{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}] => {{\"msg\": \"boom create --title pre-remediation-CVE-2024-12345\"}}") + lines.append("") + + lines.append(f"TASK [Check disk space for kernel update{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}] => {{\"msg\": \"Disk space OK: 45% used\"}}") + lines.append("") + + lines.append(f"TASK [Update kernel package{mode}] *****") + for h in hosts: + result = "changed" if status == "successful" else "fatal" + if result == "changed": + lines.append(f'changed: [{h}] => {{"msg": "kernel-5.14.0-362.24.1.el9_3 -> kernel-5.14.0-362.24.2.el9_3"}}') + else: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Permission denied", "rc": 1}}') + lines.append("") + + lines.append(f"TASK [Check if reboot is needed (needs-restarting -r){mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"rc": 1, "msg": "Reboot is required to fully utilize updates."}}') + lines.append("") + + elif "package" in playbook_name.lower(): + lines.append(f"TASK [Update target packages for CVE remediation{mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"msg": "httpd-2.4.53-7.el9 -> httpd-2.4.57-8.el9"}}') + lines.append("") + + lines.append(f"TASK [Restart affected services{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append(f"TASK [Verify service health{mode}] *****") + for h in hosts: + lines.append(f'ok: [{h}] => {{"msg": "Service httpd is running"}}') + lines.append("") + + elif "emergency" in playbook_name.lower() and status == "failed": + lines.append(f"TASK [Apply emergency patch{mode}] *****") + for h in hosts: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Missing sudo password (become_enabled not set)", "rc": 1}}') + lines.append("") + lines.append("NO MORE HOSTS LEFT *****") + lines.append("") + + else: + lines.append(f"TASK [Execute playbook tasks{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append("PLAY RECAP *****") + for h in hosts: + if status == "successful": + ok_count = random.randint(3, 6) + changed_count = random.randint(1, 3) + lines.append(f"{h:<45} : ok={ok_count} changed={changed_count} unreachable=0 failed=0 skipped=0 rescued=0 ignored=0") + elif status == "failed": + lines.append(f"{h:<45} : ok=1 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0") + lines.append("") + + return "\n".join(lines) + + +def _generate_events(job: dict) -> list[dict]: + """Generate realistic Ansible task events for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + events: list[dict] = [] + eid = 1 + + task_names = ["Gathering Facts"] + if "kernel" in job.get("name", "").lower(): + task_names += [ + "Create boom snapshot for rollback", + "Check disk space for kernel update", + "Update kernel package", + "Check if reboot is needed (needs-restarting -r)", + ] + elif "package" in job.get("name", "").lower(): + task_names += [ + "Update target packages for CVE remediation", + "Restart affected services", + "Verify service health", + ] + else: + task_names += ["Execute playbook tasks"] + + for task_name in task_names: + for host in hosts: + is_failed = job.get("status") == "failed" and task_name != "Gathering Facts" + events.append({ + "id": eid, + "type": "job_event", + "event": "runner_on_ok" if not is_failed else "runner_on_failed", + "task": task_name, + "host": host, + "host_name": host, + "play": job.get("name", ""), + "changed": task_name != "Gathering Facts" and not is_failed, + "failed": is_failed, + "event_data": { + "task": task_name, + "host": host, + "res": { + "changed": task_name != "Gathering Facts" and not is_failed, + "msg": "Task completed" if not is_failed else "Permission denied", + }, + }, + "created": _ts(timedelta(hours=4, minutes=30 - eid)), + }) + eid += 1 + + return events + + +def _generate_host_summaries(job: dict) -> list[dict]: + """Generate per-host summaries for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + summaries: list[dict] = [] + + for i, host in enumerate(hosts): + is_failed = job.get("status") == "failed" + summaries.append({ + "id": i + 1, + "type": "job_host_summary", + "host": i + 1, + "host_name": host, + "ok": 1 if is_failed else random.randint(3, 6), + "changed": 0 if is_failed else random.randint(1, 3), + "dark": 0, + "failures": 1 if is_failed else 0, + "skipped": 0, + "processed": 1, + "failed": is_failed, + }) + + return summaries + + +# --------------------------------------------------------------------------- +# MCP Tools: Job Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def job_templates_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available job templates in AAP. + + Args: + page_size: Number of results per page (default 10, max 200). + search: Optional search string to filter templates by name. + """ + results = MOCK_JOB_TEMPLATES + if search: + s = search.lower() + results = [t for t in results if s in t["name"].lower() or s in t.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_retrieve(id: str) -> dict: + """Retrieve detailed information about a specific job template. + + Args: + id: Job template ID (as string). + """ + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + return template + + +@mcp.tool() +def projects_list( + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List available projects in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter projects by name. + """ + results = MOCK_PROJECTS + if search: + s = search.lower() + results = [p for p in results if s in p["name"].lower() or s in p.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_launch_retrieve( + id: str, + requestBody: Optional[dict] = None, +) -> dict: + """Launch a job from a job template. + + Args: + id: Job template ID to launch. + requestBody: Optional launch parameters including job_type ('run' or 'check'), + extra_vars (dict), and limit (comma-separated host list). + """ + global _next_job_id + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + + body = requestBody or {} + job_type = body.get("job_type", template.get("job_type", "run")) + + if not template.get("ask_job_type_on_launch") and job_type != template.get("job_type"): + return { + "error": f"Cannot override job_type: ask_job_type_on_launch is disabled on template {id}", + } + + job_id = _next_job_id + _next_job_id += 1 + + new_job = { + "id": job_id, + "type": "job", + "name": template["name"], + "job_type": job_type, + "status": "pending", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": None, + "elapsed": 0.0, + "job_template": tid, + "inventory": template["inventory"], + "project": template["project"], + "playbook": template["playbook"], + "limit": body.get("limit", ""), + "extra_vars": str(body.get("extra_vars", {})), + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": tid, "name": template["name"]}, + }, + } + MOCK_JOBS.append(new_job) + + # Simulate job completion after launch + new_job["status"] = "successful" + new_job["finished"] = _ts(timedelta(seconds=-300)) + new_job["elapsed"] = 300.0 + + return { + "job": job_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{job_id}/", + "related": { + "stdout": f"/api/controller/v2/jobs/{job_id}/stdout/", + "job_events": f"/api/controller/v2/jobs/{job_id}/job_events/", + "job_host_summaries": f"/api/controller/v2/jobs/{job_id}/job_host_summaries/", + }, + } + + +@mcp.tool() +def jobs_retrieve(id: int) -> dict: + """Get the status and details of a job run. + + Args: + id: Job ID to retrieve. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return job + + +@mcp.tool() +def jobs_list(page_size: int = 10) -> dict: + """List recent job runs. + + Args: + page_size: Number of results to return. + """ + results = sorted(MOCK_JOBS, key=lambda j: j.get("started", ""), reverse=True) + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def jobs_stdout_retrieve(id: int, format: str = "txt") -> dict: + """Get the stdout (console output) from a job run. + + Args: + id: Job ID. + format: Output format ('txt' or 'json'). Default 'txt'. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return { + "content": _generate_stdout(job), + "range": {"start": 0, "end": 1}, + } + + +@mcp.tool() +def jobs_job_events_list(id: int, page_size: int = 50) -> dict: + """Get task-level events for a job run. + + Args: + id: Job ID. + page_size: Number of events to return. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + events = _generate_events(job) + return { + "count": len(events), + "next": None, + "previous": None, + "results": events[:page_size], + } + + +@mcp.tool() +def jobs_job_host_summaries_list(id: int) -> dict: + """Get per-host execution summaries for a job run. + + Args: + id: Job ID. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + summaries = _generate_host_summaries(job) + return { + "count": len(summaries), + "next": None, + "previous": None, + "results": summaries, + } + + +@mcp.tool() +def jobs_relaunch_retrieve( + id: int, + hosts: str = "all", + job_type: str = "run", +) -> dict: + """Relaunch a previously completed or failed job. + + Args: + id: Original job ID to relaunch. + hosts: Which hosts to target ('all' or 'failed'). + job_type: Job type for relaunch ('run' or 'check'). + """ + global _next_job_id + original = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not original: + return {"detail": f"Not found. Job {id} does not exist."} + + new_id = _next_job_id + _next_job_id += 1 + + new_job = { + **original, + "id": new_id, + "job_type": job_type, + "status": "successful", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": _ts(timedelta(seconds=-300)), + "elapsed": 300.0, + "launch_type": "relaunch", + } + MOCK_JOBS.append(new_job) + + return { + "job": new_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{new_id}/", + } + + +# --------------------------------------------------------------------------- +# MCP Tools: Inventory Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def inventories_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available inventories in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter inventories. + """ + results = MOCK_INVENTORIES + if search: + s = search.lower() + results = [inv for inv in results if s in inv["name"].lower() or s in inv.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def hosts_list( + inventory_id: Optional[int] = None, + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List hosts in an inventory. + + Args: + inventory_id: Filter by inventory ID. If not provided, lists hosts from all inventories. + page_size: Number of results per page. + search: Optional search string to filter hosts by name. + """ + inv_id = inventory_id or 1 + hosts = _generate_hosts(inv_id) + if search: + s = search.lower() + hosts = [h for h in hosts if s in h["name"].lower()] + return { + "count": len(hosts), + "next": None if len(hosts) <= page_size else f"/api/controller/v2/hosts/?page=2", + "previous": None, + "results": hosts[:page_size], + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__playbook-executor/instruction.md b/evaluation/with_skills/rh-sre__playbook-executor/instruction.md new file mode 100644 index 00000000..5cced969 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-executor/instruction.md @@ -0,0 +1,18 @@ +# Playbook Execution Task + +You are a Red Hat SRE. A remediation playbook needs to be executed against production systems through Ansible Automation Platform. You are responsible for the safe execution and monitoring of this job. + +## Scenario +A CVE remediation playbook has been prepared and a job template exists in AAP. You need to execute it safely: validate the template first, consider running a dry-run, launch the production job, monitor its progress, and report the results. + +## Requirements +- Locate and validate the job template in AAP (check it has the right inventory, project, credentials, and privilege escalation) +- Document a pre-flight checklist: template readiness, target hosts, and any prerequisites +- Plan the execution: whether to run a dry-run (check mode) first, how to monitor job progress, and what to do if it fails +- Launch the job (or document the launch procedure) and monitor its status +- Report per-host results: which hosts succeeded, which failed, and any error details +- Include guidance for handling failures (retry, rollback, escalation) + +Document your methodology, execution plan, and results in `/root/report.md`. + +Use MCP tools to interact with AAP. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-sre__playbook-executor/solution/solve.sh b/evaluation/with_skills/rh-sre__playbook-executor/solution/solve.sh new file mode 100644 index 00000000..090c2294 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-executor/solution/solve.sh @@ -0,0 +1,21 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Playbook Execution Report + +## Execution Steps +1. Dry-run: job_type='check' (Ansible check mode) +2. Review results +3. Execute: job_type='run' + +## Git Flow +Playbook stored at playbooks/remediation/cve-2024-12345.yml. Commit, push, wait for sync complete before launch. No override at launch—AAP runs from synced project. + +## Job Template Validation +Invoke job-template-remediation-validator for each candidate template. + +## Execution Report +- Status: Success +- Systems patched: 4/4 +- Validate job log (jobs_stdout_retrieve) for CVE handling +- Suggest remediation-verifier after success +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__playbook-executor/task.toml b/evaluation/with_skills/rh-sre__playbook-executor/task.toml new file mode 100644 index 00000000..eaa9b790 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-executor/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__playbook-executor" +name = "rh-sre Playbook Execution Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "playbook-executor", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__playbook-executor/tests/llm_judge.py b/evaluation/with_skills/rh-sre__playbook-executor/tests/llm_judge.py new file mode 100644 index 00000000..15da24ed --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-executor/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "launch_config_and_git_flow", "file": "/root/report.md", "question": "Does the report configure launch-time prompts for flexibility (variables, host limits, job type) and require Git synchronization before execution?", "reference": "A skilled report configures launch-time prompts and requires Git sync. An unskilled report hardcodes execution settings and skips synchronization requirements."}, + {"id": "relaunch_failed_hosts", "file": "/root/report.md", "question": "Does the report mention relaunching with hosts: failed to retry only failed hosts?", "reference": "A skilled report uses jobs_relaunch_retrieve with hosts: failed. An unskilled report suggests full re-execution."}, + {"id": "dry_run_and_monitoring", "file": "/root/report.md", "question": "Does the report recommend dry-run first and include per-host execution monitoring?", "reference": "A skilled report follows check mode before run and monitors per-host. An unskilled report runs directly without dry-run."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__playbook-executor/tests/test.sh b/evaluation/with_skills/rh-sre__playbook-executor/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-executor/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__playbook-executor/tests/test_outputs.py b/evaluation/with_skills/rh-sre__playbook-executor/tests/test_outputs.py new file mode 100644 index 00000000..dab37078 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-executor/tests/test_outputs.py @@ -0,0 +1,89 @@ +""" +Tests for rh-sre__playbook-executor per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['playbook', 'execut', 'job']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_git_flow_mandatory(self): + """Skill: When template playbook path differs from generated playbook, Git Flow (commit, push, sync) is MANDATORY before launch.""" + c = read_report().lower() + has_git = any(t in c for t in ["git", "commit", "push", "sync"]) + has_block = any(t in c for t in ["before launch", "mandatory", "must", "block", "sync complete"]) + assert has_git or has_block, ( + "should require Git Flow when path differs (skill: no override at launch)" + ) + + def test_launch_configuration(self): + """Skill teaches configuring launch-time prompts for execution flexibility + (job type, variables, host limiting). Without skill, agents run playbooks + with hardcoded settings.""" + c = read_report().lower() + has_launch = any(t in c for t in ["launch", "prompt", "on launch"]) + has_config = any(t in c for t in [ + "variable", "limit", "job type", "configur", + ]) + assert has_launch and has_config, ( + "should configure launch-time prompts for execution flexibility" + ) + + def test_relaunch_failed_hosts(self): + """Skill: jobs_relaunch_retrieve with hosts: 'failed' to retry only failed hosts.""" + c = read_report().lower() + assert any(t in c for t in ["relaunch", "failed hosts", "retry failed"]), ( + "should mention relaunch for failed hosts (skill: jobs_relaunch_retrieve)" + ) + + def test_dry_run_first(self): + """Skill: Recommend dry-run (check mode) before production execution.""" + c = read_report().lower() + assert any(t in c for t in ["dry", "check mode", "check_mode", "preview", "before launch"]), ( + "should recommend dry-run first (skill: Phase 3)" + ) + + def test_per_host_results(self): + """Skill: Report per-host results (succeeded, failed, error details).""" + c = read_report().lower() + has_per_host = any(t in c for t in ["per host", "each host", "host result", "stdout", "host summary"]) + has_ansible_outcome = any(t in c for t in ["succeeded", "failed", "unreachable", "skipped", "changed"]) + assert has_per_host or has_ansible_outcome, ( + "should report per-host execution results (skill: host summaries)" + ) + + def test_error_taxonomy(self): + """Docs teach error taxonomy: connection/permissions/package/service/disk + failure categories with specific recovery paths. + Without docs, agents treat all errors generically.""" + c = read_report().lower() + categories = ["connection", "permission", "package", "service", "disk"] + mentioned = sum(1 for cat in categories if cat in c) + assert mentioned >= 2, ( + "should categorize errors by type (connection/permissions/package/service/disk)" + ) diff --git a/evaluation/with_skills/rh-sre__playbook-generator/environment/Dockerfile b/evaluation/with_skills/rh-sre__playbook-generator/environment/Dockerfile new file mode 100644 index 00000000..484ebb33 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-generator/environment/Dockerfile @@ -0,0 +1,52 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__playbook-generator/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__playbook-generator/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..2269a235 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-generator/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,722 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2026-1234": { + "cve_id": "CVE-2026-1234", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Critical kernel vulnerability: remote code execution in kernel network stack allows unauthenticated attackers to execute arbitrary code via crafted packets", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2026-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 8, + "total_remediated": 2, + "total_vulnerable": 6, + }, + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__playbook-generator/instruction.md b/evaluation/with_skills/rh-sre__playbook-generator/instruction.md new file mode 100644 index 00000000..585c6f73 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-generator/instruction.md @@ -0,0 +1,17 @@ +# Playbook Generation Task + +You are a Red Hat SRE. A critical CVE has been identified affecting systems in your fleet. You need to generate a remediation playbook that can be used to patch the affected hosts. + +## Scenario +CVE-2026-1234 has been confirmed as a critical kernel vulnerability affecting multiple production RHEL systems. You need to generate an Ansible playbook that remediates this CVE on the affected hosts. + +## Requirements +- Use available tools to generate a remediation playbook for the CVE +- Review the generated playbook content: what packages it updates, whether it requires a reboot, and any risk factors +- Document the playbook metadata: target CVE, affected systems, reboot requirements, and delegation safety notes +- If playbook generation fails, document the failure and describe alternative approaches +- The playbook should be ready to hand off for execution (do not execute it yourself) + +Document the generated playbook and your analysis in `/root/report.md`. You MUST write the report file — do not just display the content. + +Use MCP tools to interact with the environment. If reference documentation or skills are available in this environment, consult them before beginning work. Complete the entire analysis autonomously — do not stop after preliminary steps. Proceed through playbook generation, review, and report writing without waiting for user input. diff --git a/evaluation/with_skills/rh-sre__playbook-generator/solution/solve.sh b/evaluation/with_skills/rh-sre__playbook-generator/solution/solve.sh new file mode 100644 index 00000000..2543cf93 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-generator/solution/solve.sh @@ -0,0 +1,38 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Playbook Generation Report + +## Methodology +Used `create_vulnerability_playbook` MCP tool (remediations endpoint via Lightspeed) to generate the remediation playbook for CVE-2026-1234. The playbook is returned AS IS — unmodified from the generation tool output. No pre-flight, backup, or restart steps were added. + +## Generated Playbook (returned AS IS, unmodified) +```yaml +- hosts: affected_systems + tasks: + - block: + - name: Create boom snapshot + command: boom create --title "pre-cve-{{ cve_id }}" + - name: Apply patch + dnf: + name: '*' + state: latest + rescue: + - name: Rollback + command: boom rollback + always: + - name: Check reboot needed + command: needs-restarting -r + register: needs_restarting +``` + +## Key Patterns +- block/rescue/always for error handling +- needs-restarting -r for reboot detection (RHEL 8/9) +- boom create for kernel/snapshot before remediation + +## Failure Handling +If the create_vulnerability_playbook tool fails, do not auto-generate a playbook from general knowledge. Present the user with options: (A) Retry the tool, (B) Generate from knowledge with explicit user approval, or (C) Exit and escalate. + +## Execution +Do NOT run ansible-playbook directly. Delegate execution to the playbook-executor skill/workflow. +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__playbook-generator/task.toml b/evaluation/with_skills/rh-sre__playbook-generator/task.toml new file mode 100644 index 00000000..1ea4f1ac --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-generator/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__playbook-generator" +name = "rh-sre Playbook Generation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "playbook-generator", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__playbook-generator/tests/llm_judge.py b/evaluation/with_skills/rh-sre__playbook-generator/tests/llm_judge.py new file mode 100644 index 00000000..05cd660f --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-generator/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "mcp_tool_for_generation", "file": "/root/report.md", "question": "Does the report describe using an MCP tool (such as create_vulnerability_playbook or a remediations/Lightspeed endpoint) to generate the playbook, rather than writing it manually from scratch?", "reference": "A skilled report uses the Lightspeed MCP create_vulnerability_playbook tool. An unskilled report writes the playbook manually from general Ansible knowledge without using an MCP generation tool."}, + {"id": "return_as_is", "file": "/root/report.md", "question": "Does the report explicitly state that the generated playbook should be returned AS IS or unmodified, without adding extra steps like pre-flight checks, backup tasks, or restart handlers?", "reference": "A skilled report emphasizes returning the tool output unmodified. An unskilled report adds pre-flight checks, backup steps, restart handlers, or other enhancements to the generated playbook."}, + {"id": "delegation_not_execution", "file": "/root/report.md", "question": "Does the report explicitly state that playbook execution should be delegated to a separate execution workflow and NOT run directly via ansible-playbook?", "reference": "A skilled report delegates execution to a dedicated execution workflow rather than running ansible-playbook directly. An unskilled report runs ansible-playbook directly or doesn't address the execution boundary."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__playbook-generator/tests/test.sh b/evaluation/with_skills/rh-sre__playbook-generator/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-generator/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__playbook-generator/tests/test_outputs.py b/evaluation/with_skills/rh-sre__playbook-generator/tests/test_outputs.py new file mode 100644 index 00000000..00518d36 --- /dev/null +++ b/evaluation/with_skills/rh-sre__playbook-generator/tests/test_outputs.py @@ -0,0 +1,74 @@ +""" +Tests for rh-sre__playbook-generator per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['playbook', 'generat', 'cve']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_mcp_tool_for_generation(self): + """Skill: Use create_vulnerability_playbook MCP tool, not manual playbook writing.""" + c = read_report().lower() + assert any(t in c for t in [ + "create_vulnerability_playbook", "create_vuln_playbook", + "remediations", "lightspeed", + ]) and any(t in c for t in ["tool", "mcp", "generat"]), ( + "should reference MCP tool usage for playbook generation (not manual writing)" + ) + + def test_no_modifications_to_playbook(self): + """Skill: Return playbook AS IS, no modifications—never add pre-flight, backup, restart.""" + c = read_report().lower() + assert any(t in c for t in [ + "as is", "as-is", "unmodified", "do not modify", "no modification", + "unchanged", "without modification", "returned unchanged", + "original output", "generated output", + ]), "should return playbook unmodified (skill: no enhancements without user approval)" + + def test_no_auto_generate_on_failure(self): + """Skill: Never auto-generate playbooks from general knowledge without approval.""" + c = read_report().lower() + has_constraint = any(t in c for t in [ + "do not auto", "never auto", "not auto-generat", + "without approval", "explicit approval", "user approval", + "do not generat", "never generat", + ]) + has_options = any(t in c for t in ["retry", "option", "escalat"]) + assert has_constraint or has_options, ( + "should state not to auto-generate playbooks without user approval" + ) + + def test_delegation_to_executor(self): + """Skill: This skill ONLY generates; execution delegated to playbook-executor.""" + c = read_report().lower() + assert any(t in c for t in [ + 'delegat', 'executor', 'playbook-executor', 'hand off', + 'not execute', 'do not run', 'do not execute', + 'not run ansible-playbook', 'not ansible-playbook', + ]), "should delegate execution (not run ansible-playbook directly)" diff --git a/evaluation/with_skills/rh-sre__remediation-verifier/environment/Dockerfile b/evaluation/with_skills/rh-sre__remediation-verifier/environment/Dockerfile new file mode 100644 index 00000000..484ebb33 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation-verifier/environment/Dockerfile @@ -0,0 +1,52 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__remediation-verifier/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__remediation-verifier/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..e826c96e --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation-verifier/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,759 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def _system_profile_for_host(host_type: str, rhel_version: str, sid: int) -> dict: + """Generate system_profile fields for a host based on type and RHEL version.""" + el = "el9" if rhel_version.startswith("9") else "el8" + kernel = f"5.14.0-362.24.1.{el}_3.x86_64" if "9" in rhel_version else f"4.18.0-477.27.1.{el}.x86_64" + base_pkgs = [ + {"name": "kernel-core", "version": f"5.14.0-362.24.1.{el}.x86_64"}, + {"name": "httpd", "version": f"2.4.57-5.{el}"}, + {"name": "sshd", "version": f"8.9p1-23.{el}"}, + {"name": "firewalld", "version": f"1.2.5-4.{el}"}, + {"name": "systemd", "version": f"250-19.{el}"}, + ] + if "web" in host_type or "lb" in host_type: + base_pkgs.extend([ + {"name": "nginx", "version": f"1.24.1-3.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + elif "db" in host_type: + base_pkgs.extend([ + {"name": "postgresql", "version": f"15.4-1.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + elif "mon" in host_type: + base_pkgs.extend([ + {"name": "prometheus", "version": f"2.45.0-1.{el}"}, + {"name": "node_exporter", "version": f"1.6.1-2.{el}"}, + ]) + else: + base_pkgs.extend([ + {"name": "java-17-openjdk", "version": f"17.0.8-4.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + services = ["sshd.service", "firewalld.service", "chronyd.service"] + if "web" in host_type or "lb" in host_type: + services.append("httpd.service") + elif "db" in host_type: + services.extend(["postgresql.service", "postgresql-15.service"]) + elif "mon" in host_type: + services.extend(["prometheus.service", "node_exporter.service"]) + else: + services.append("httpd.service") + ip_octet = 10 + (sid % 245) + mac_hex = f"{(sid % 256):02x}" + return { + "installed_packages": base_pkgs[:8], + "running_services": services, + "network_interfaces": [ + {"name": "eth0", "ipv4": [f"10.0.1.{ip_octet}"], "mac": f"52:54:00:a1:b2:{mac_hex}"}, + {"name": "lo", "ipv4": ["127.0.0.1"], "mac": "00:00:00:00:00:00"}, + ], + "kernel_version": kernel, + } + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + # Add system_profile to each host + for idx, s in enumerate(systems): + host_type = "app" # default + for ht in ["web", "db", "app", "lb", "mon", "cache"]: + if ht in s["id"]: + host_type = ht + break + s["system_profile"] = _system_profile_for_host( + host_type, s["rhel_version"], idx + 1 + ) + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__remediation-verifier/instruction.md b/evaluation/with_skills/rh-sre__remediation-verifier/instruction.md new file mode 100644 index 00000000..845397fa --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation-verifier/instruction.md @@ -0,0 +1,18 @@ +# Post-Remediation Verification Task + +You are a Red Hat SRE. A CVE remediation playbook was recently executed against production systems. You need to verify that the fix was successfully applied and that the systems are healthy. + +## Scenario +The operations team ran a remediation playbook for CVE-2026-1234 against affected RHEL systems earlier today. You need to confirm the remediation actually worked and that no systems were left in a broken state. + +## Requirements +- Check whether the affected systems are still listed as vulnerable to the CVE +- Verify that the relevant packages have been updated to the fixed versions +- Confirm that critical services on each system are running and healthy after the patch +- Identify any systems where remediation failed or is incomplete +- For any failures, provide troubleshooting guidance (package conflicts, services not restarting, partial patches) +- Produce a verification summary: how many systems passed, how many failed, and any remaining action items + +Document your methodology, verification results, and assessment in `/root/report.md`. + +Use MCP tools to query system data. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-sre__remediation-verifier/solution/solve.sh b/evaluation/with_skills/rh-sre__remediation-verifier/solution/solve.sh new file mode 100644 index 00000000..49714723 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation-verifier/solution/solve.sh @@ -0,0 +1,20 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Remediation Verification + +## CVE-2024-12345 Status +| System | RPM Status | CVE Status | Service Health | +|--------|-----------|------------|----------------| +| web-01 | installed >= fixed | Patched | Healthy | + +## Checks Performed +- get_cve_systems: System removed from affected list or status=patched +- get_host_details: system_profile.installed_packages >= expected fixed version +- systemd_failed_units: No service disruptions +- enabled_services, running_processes: verified + +## Notes +- Lightspeed inventory lag: up to 24 hours +- Recommend: insights-client --check-results to update inventory +- RPM comparison: installed version >= expected fixed version +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__remediation-verifier/task.toml b/evaluation/with_skills/rh-sre__remediation-verifier/task.toml new file mode 100644 index 00000000..23f81673 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation-verifier/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__remediation-verifier" +name = "rh-sre Remediation Verification Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "remediation-verifier", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__remediation-verifier/tests/llm_judge.py b/evaluation/with_skills/rh-sre__remediation-verifier/tests/llm_judge.py new file mode 100644 index 00000000..15b8919b --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation-verifier/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "inventory_24h_lag", "file": "/root/report.md", "question": "Does the report note that Lightspeed inventory can take up to 24 hours to update and recommend insights-client --check-results for re-sync?", "reference": "A skilled report warns about inventory lag. An unskilled report expects immediate updates."}, + {"id": "system_profile_checks", "file": "/root/report.md", "question": "Does the report use get_host_details with include_system_profile for installed packages and service health verification?", "reference": "A skilled report uses system profile data. An unskilled report only checks CVE status."}, + {"id": "three_verification_layers", "file": "/root/report.md", "question": "Does the report verify at least 2 of: CVE status, package version, service health?", "reference": "A skilled report performs defense-in-depth verification. An unskilled report only checks one layer."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__remediation-verifier/tests/test.sh b/evaluation/with_skills/rh-sre__remediation-verifier/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation-verifier/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__remediation-verifier/tests/test_outputs.py b/evaluation/with_skills/rh-sre__remediation-verifier/tests/test_outputs.py new file mode 100644 index 00000000..00ddada6 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation-verifier/tests/test_outputs.py @@ -0,0 +1,75 @@ +""" +Tests for rh-sre__remediation-verifier per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['verif', 'remediation', 'confirm']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_three_checks(self): + """Skill: Verify CVE status + package version + service health (defense in depth).""" + c = read_report().lower() + has_cve = any(t in c for t in ["cve", "vulnerab", "patched", "affected"]) + has_pkg = any(t in c for t in ["package", "version", "installed", "fixed"]) + has_svc = any(t in c for t in ["service", "running", "health", "enabled"]) + assert (has_cve and has_pkg) or (has_cve and has_svc) or (has_pkg and has_svc), ( + "should perform at least 2 of 3 checks (skill: CVE status, package, service)" + ) + + def test_package_version_comparison(self): + """Skill: Compare installed package version to expected fixed version (RPM-style).""" + c = read_report().lower() + has_compare = any(t in c for t in ["compare", "version", "expected", "installed"]) + has_fixed = any(t in c for t in ["fixed", "updated", "el8", "el9"]) + assert has_compare or has_fixed, ( + "should compare package versions (skill: verify_package_version)" + ) + + def test_inventory_24h_lag(self): + """Skill: Lightspeed inventory can take up to 24 hours to reflect updated package versions.""" + c = read_report().lower() + has_24 = "24" in c + has_timing = any(t in c for t in ["hour", "propagat", "delay"]) + assert has_24 and has_timing, ( + "should note inventory 24h lag (skill: Best Practices)" + ) + + def test_include_system_profile(self): + """Skill: get_host_details with include_system_profile: true returns installed_packages, enabled_services.""" + c = read_report().lower() + assert any(t in c for t in ["include_system_profile", "system_profile", "installed_packages"]), ( + "should reference include_system_profile for packages/services (skill)" + ) + + def test_insights_client_resync(self): + """Skill: insights-client --check-results triggers inventory re-sync.""" + c = read_report().lower() + assert any(t in c for t in ["insights-client", "check-results"]), ( + "should mention insights-client for inventory resync (skill)" + ) diff --git a/evaluation/with_skills/rh-sre__remediation/environment/Dockerfile b/evaluation/with_skills/rh-sre__remediation/environment/Dockerfile new file mode 100644 index 00000000..484ebb33 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation/environment/Dockerfile @@ -0,0 +1,52 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__remediation/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__remediation/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..2269a235 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,722 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2026-1234": { + "cve_id": "CVE-2026-1234", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Critical kernel vulnerability: remote code execution in kernel network stack allows unauthenticated attackers to execute arbitrary code via crafted packets", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2026-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 8, + "total_remediated": 2, + "total_vulnerable": 6, + }, + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__remediation/instruction.md b/evaluation/with_skills/rh-sre__remediation/instruction.md new file mode 100644 index 00000000..ffd80028 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation/instruction.md @@ -0,0 +1,19 @@ +# CVE Remediation Workflow Task + +You are a Red Hat SRE. A critical CVE has been reported and you need to plan and document a complete end-to-end remediation workflow, from initial validation through execution and verification. + +## Scenario +CVE-2026-1234 (Critical, CVSS 9.8) has been identified as affecting production RHEL systems in your fleet. Management wants a comprehensive remediation plan that covers every phase of the response. + +## Requirements +- Validate the CVE: confirm it is real, assess its severity, and determine if a remediation is available +- Assess the impact: identify which systems are affected and their criticality +- Gather system context: understand each affected system's role, dependencies, and constraints before patching +- Plan playbook generation: how the remediation playbook will be created +- Plan execution: how the playbook will be run (dry-run first, then production), including approval gates and rollback strategy +- Plan verification: how you will confirm remediation was successful after execution +- Present a phased workflow with clear decision points and user confirmation steps at each gate + +Document the complete workflow plan in `/root/report.md`. + +Use MCP tools to query data. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-sre__remediation/solution/solve.sh b/evaluation/with_skills/rh-sre__remediation/solution/solve.sh new file mode 100644 index 00000000..2721e5ff --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation/solution/solve.sh @@ -0,0 +1,21 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Remediation Plan + +## Orchestration Order +1. Validate MCP connectivity +2. CVE impact analysis +3. Validate CVE remediation availability +4. Gather system context +5. Generate playbook +6. Execute playbook +7. Verify remediation + +## CVE-2024-12345 +- Remediatable: Yes +- Systems: 4 production +- Template: Kernel update with boom snapshot + +## Execution +Wait for user confirmation (yes/proceed) before Step 5 (Execute playbook). Dry-run first, then production run. +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__remediation/task.toml b/evaluation/with_skills/rh-sre__remediation/task.toml new file mode 100644 index 00000000..1922d4d5 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__remediation" +name = "rh-sre CVE Remediation Planning Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "remediation", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__remediation/tests/llm_judge.py b/evaluation/with_skills/rh-sre__remediation/tests/llm_judge.py new file mode 100644 index 00000000..c5278840 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "remediation_gate", "file": "/root/report.md", "question": "Does the report gate on remediation availability (checking whether automated remediation is possible for a CVE) before proceeding with playbook generation?", "reference": "A skilled report checks whether automated remediation is available as a prerequisite gate before attempting playbook generation. An unskilled report proceeds to generate playbooks without first verifying that remediation is available for the target CVEs."}, + {"id": "plan_before_execution", "file": "/root/report.md", "question": "Does the report present a Remediation Plan with summary/table/checklist for user confirmation before execution?", "reference": "A skilled report requires plan validation before execution. An unskilled report executes without plan review."}, + {"id": "two_part_confirmation", "file": "/root/report.md", "question": "Does the report describe two distinct confirmation checkpoints: one BEFORE starting (upfront planned tasks / Part A) and one AFTER playbook generation but BEFORE execution (execution plan / Part B)?", "reference": "A skilled report has Part A (upfront planned tasks before any remediation step) and Part B (execution plan confirmation after playbook is generated but before running it). An unskilled report has at most one confirmation checkpoint or no structured confirmation phases."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__remediation/tests/test.sh b/evaluation/with_skills/rh-sre__remediation/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__remediation/tests/test_outputs.py b/evaluation/with_skills/rh-sre__remediation/tests/test_outputs.py new file mode 100644 index 00000000..bad4f7c8 --- /dev/null +++ b/evaluation/with_skills/rh-sre__remediation/tests/test_outputs.py @@ -0,0 +1,78 @@ +""" +Tests for rh-sre__remediation per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['remediation', 'orchestrat', 'workflow']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_seven_step_sequence(self): + """Skill: Orchestrate in order: validate MCP → impact → validate CVE → context → playbook → execute → verify.""" + c = read_report().lower() + has_sequence = any(t in c for t in ["validate", "impact", "context", "playbook", "execute", "verify"]) + has_order = any(t in c for t in ["step", "phase", "before", "workflow order", "sequence"]) + assert has_sequence and has_order, ( + "should define 7-step orchestration sequence (skill: workflow order)" + ) + + def test_remediatable_gate(self): + """Skill: Gate on cve-validation: if not remediatable, stop or warn before playbook generation.""" + c = read_report().lower() + has_gate = any(t in c for t in ["remediat", "gate", "remediation_available", "advisory"]) + has_stop = any(t in c for t in ["stop", "cannot proceed", "no automated", "manual"]) + assert has_gate or has_stop, ( + "should gate on remediation availability (skill: Remediatable Gate)" + ) + + def test_plan_validation_before_execute(self): + """Skill: Present Remediation Plan (summary, table, checklist); wait for user yes/proceed before Step 5.""" + c = read_report().lower() + has_plan = any(t in c for t in ["plan", "checklist", "summary", "table"]) + has_confirm = any(t in c for t in ["confirm", "proceed", "approval", "yes", "abort"]) + assert has_plan and has_confirm, ( + "should require plan validation before execution (skill: Remediation Plan)" + ) + + def test_dry_run_recommendation(self): + """Skill: Recommend dry-run first; wait for explicit approval before actual execution.""" + c = read_report().lower() + assert any(t in c for t in ["dry-run", "dry run", "check mode", "preview"]), ( + "should recommend dry-run first (skill: before Step 5)" + ) + + def test_two_part_confirmation(self): + """Docs teach Part A (pre-Step-0) and Part B (post-Step-4) confirmations + with ordered step completion marking. Without docs, agents use single confirmation.""" + c = read_report().lower() + assert any(t in c for t in [ + "part a", "part b", "pre-step", "post-step", "two-part", + "before step 0", "after step 4", + ]) or ("confirm" in c and "step" in c), ( + "should use two-part confirmation (Part A pre-Step-0, Part B post-Step-4)" + ) diff --git a/evaluation/with_skills/rh-sre__system-context/environment/Dockerfile b/evaluation/with_skills/rh-sre__system-context/environment/Dockerfile new file mode 100644 index 00000000..484ebb33 --- /dev/null +++ b/evaluation/with_skills/rh-sre__system-context/environment/Dockerfile @@ -0,0 +1,52 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + + + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-sre__system-context/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/with_skills/rh-sre__system-context/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..e826c96e --- /dev/null +++ b/evaluation/with_skills/rh-sre__system-context/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,759 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def _system_profile_for_host(host_type: str, rhel_version: str, sid: int) -> dict: + """Generate system_profile fields for a host based on type and RHEL version.""" + el = "el9" if rhel_version.startswith("9") else "el8" + kernel = f"5.14.0-362.24.1.{el}_3.x86_64" if "9" in rhel_version else f"4.18.0-477.27.1.{el}.x86_64" + base_pkgs = [ + {"name": "kernel-core", "version": f"5.14.0-362.24.1.{el}.x86_64"}, + {"name": "httpd", "version": f"2.4.57-5.{el}"}, + {"name": "sshd", "version": f"8.9p1-23.{el}"}, + {"name": "firewalld", "version": f"1.2.5-4.{el}"}, + {"name": "systemd", "version": f"250-19.{el}"}, + ] + if "web" in host_type or "lb" in host_type: + base_pkgs.extend([ + {"name": "nginx", "version": f"1.24.1-3.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + elif "db" in host_type: + base_pkgs.extend([ + {"name": "postgresql", "version": f"15.4-1.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + elif "mon" in host_type: + base_pkgs.extend([ + {"name": "prometheus", "version": f"2.45.0-1.{el}"}, + {"name": "node_exporter", "version": f"1.6.1-2.{el}"}, + ]) + else: + base_pkgs.extend([ + {"name": "java-17-openjdk", "version": f"17.0.8-4.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + services = ["sshd.service", "firewalld.service", "chronyd.service"] + if "web" in host_type or "lb" in host_type: + services.append("httpd.service") + elif "db" in host_type: + services.extend(["postgresql.service", "postgresql-15.service"]) + elif "mon" in host_type: + services.extend(["prometheus.service", "node_exporter.service"]) + else: + services.append("httpd.service") + ip_octet = 10 + (sid % 245) + mac_hex = f"{(sid % 256):02x}" + return { + "installed_packages": base_pkgs[:8], + "running_services": services, + "network_interfaces": [ + {"name": "eth0", "ipv4": [f"10.0.1.{ip_octet}"], "mac": f"52:54:00:a1:b2:{mac_hex}"}, + {"name": "lo", "ipv4": ["127.0.0.1"], "mac": "00:00:00:00:00:00"}, + ], + "kernel_version": kernel, + } + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + # Add system_profile to each host + for idx, s in enumerate(systems): + host_type = "app" # default + for ht in ["web", "db", "app", "lb", "mon", "cache"]: + if ht in s["id"]: + host_type = ht + break + s["system_profile"] = _system_profile_for_host( + host_type, s["rhel_version"], idx + 1 + ) + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-sre__system-context/instruction.md b/evaluation/with_skills/rh-sre__system-context/instruction.md new file mode 100644 index 00000000..95d0540e --- /dev/null +++ b/evaluation/with_skills/rh-sre__system-context/instruction.md @@ -0,0 +1,16 @@ +# System Context Task + +You are a Red Hat SRE. Before rolling out a remediation for a critical vulnerability, you need to gather comprehensive context about the affected systems to make safe remediation decisions. + +## Scenario +A high-severity advisory has been identified that affects multiple systems in your fleet. Before applying any patches, you need to understand each affected system's role, current health, installed packages, running services, and any special constraints (maintenance windows, compliance requirements, dependencies). + +## Requirements +- Use MCP tools to query systems in the fleet and identify those affected by the advisory +- For each affected system, gather: system role and criticality, current health and uptime, installed package versions relevant to the advisory, running services that may be impacted, and any compliance or scheduling constraints +- Assess which systems can be patched immediately vs. which need coordination +- Identify dependencies between systems that affect remediation ordering + +Document your system context analysis and remediation readiness assessment in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-sre__system-context/solution/solve.sh b/evaluation/with_skills/rh-sre__system-context/solution/solve.sh new file mode 100644 index 00000000..94c4eb6d --- /dev/null +++ b/evaluation/with_skills/rh-sre__system-context/solution/solve.sh @@ -0,0 +1,19 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# System Context Report + +## Affected Systems +| System | RHEL | Environment | Infrastructure | Tags | +|--------|------|-------------|----------------|------| +| web-01 | 9.3 | Production | bare_metal | pci-compliant | +| db-01 | 8.9 | Staging | virtualized | - | + +## Data Source +get_cve_systems + get_host_details with include_system_profile=true. system_profile: rhel_version, infrastructure_type, installed_packages. + +## Remediation Strategy (Decision Matrix) +- Deployment type: Batch (multiple systems) +- Infrastructure: Bare metal, virtualized +- Maintenance window: Required for production +- Kubernetes: Rolling update with pod eviction if K8s nodes +REPORT_EOF diff --git a/evaluation/with_skills/rh-sre__system-context/task.toml b/evaluation/with_skills/rh-sre__system-context/task.toml new file mode 100644 index 00000000..d060c445 --- /dev/null +++ b/evaluation/with_skills/rh-sre__system-context/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__system-context" +name = "rh-sre System Context Gathering Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "system-context", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-sre__system-context/tests/llm_judge.py b/evaluation/with_skills/rh-sre__system-context/tests/llm_judge.py new file mode 100644 index 00000000..c2970b3d --- /dev/null +++ b/evaluation/with_skills/rh-sre__system-context/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "infrastructure_classification", "file": "/root/report.md", "question": "Does the report classify systems by infrastructure_type (bare_metal/virtualized/container) and infrastructure_vendor?", "reference": "A skilled report uses infrastructure classification fields. An unskilled report doesn't distinguish infrastructure types."}, + {"id": "kubernetes_safety_context", "file": "/root/report.md", "question": "Does the report consider Kubernetes context (PDBs, daemonsets) for safe remediation planning?", "reference": "A skilled report checks hasPdbs and daemonsets for safety. An unskilled report ignores K8s workload context."}, + {"id": "staged_rollout", "file": "/root/report.md", "question": "Does the report recommend staged rollout (staging first, then production batches) based on environment criticality?", "reference": "A skilled report follows staged rollout pattern. An unskilled report patches all systems simultaneously."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-sre__system-context/tests/test.sh b/evaluation/with_skills/rh-sre__system-context/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-sre__system-context/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-sre__system-context/tests/test_outputs.py b/evaluation/with_skills/rh-sre__system-context/tests/test_outputs.py new file mode 100644 index 00000000..ff39869d --- /dev/null +++ b/evaluation/with_skills/rh-sre__system-context/tests/test_outputs.py @@ -0,0 +1,84 @@ +""" +Tests for rh-sre__system-context per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['system', 'context', 'environment']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_remediation_strategy_by_context(self): + """Skill: Determine strategy from context: batch vs rolling, maintenance window, pod eviction for K8s.""" + c = read_report().lower() + has_strategy = any(t in c for t in ["strategy", "approach", "rolling", "batch"]) + has_context = any(t in c for t in ["maintenance", "pod eviction", "kubernetes", "staging first"]) + assert has_strategy and has_context, ( + "should derive strategy from context (skill: Decision Matrix)" + ) + + def test_rhel_version_distribution(self): + """Skill: Report RHEL version distribution (playbook must support multiple versions).""" + c = read_report().lower() + assert any(t in c for t in ['rhel', 'version', 'distribution', 'el7', 'el8', 'el9']), ( + "Should report RHEL version distribution (skill: conditional dnf/yum)" + ) + + def test_environment_and_criticality(self): + """Skill: Classify by environment (prod/staging/dev) and criticality for rollout order.""" + c = read_report().lower() + has_env = any(t in c for t in ["staging", "development", "rollout_order", "rollout order"]) + has_crit = any(t in c for t in ["critical", "criticality", "priority", "high", "rollout"]) + assert has_env and has_crit, ( + "should classify by environment and criticality (skill: rollout_order)" + ) + + def test_infrastructure_classification(self): + """Skill: infrastructure_type (bare_metal/virtualized/container) and infrastructure_vendor (kvm) fields.""" + c = read_report().lower() + has_type = any(t in c for t in ["infrastructure_type", "infrastructure_vendor", "virtualized"]) + has_bare = "bare_metal" in c or "bare metal" in c + assert has_type or has_bare, ( + "should reference infrastructure classification (skill: bare_metal/virtualized/container)" + ) + + def test_kubernetes_context_fields(self): + """Skill: hasPdbs and daemonsets_present for safety planning in K8s context.""" + c = read_report().lower() + has_k8s = any(t in c for t in ["pdb", "daemonset"]) + has_safety = any(t in c for t in ["safety", "eviction"]) + assert has_k8s and has_safety, ( + "should reference PDB/daemonset for K8s safety (skill)" + ) + + def test_needs_restarting_check(self): + """Docs teach needs-restarting -r (exit code 0=no reboot, 1=reboot needed) + and -s for services needing restart. Without docs, agents skip this check.""" + c = read_report().lower() + assert any(t in c for t in [ + "needs-restarting", "needs_restarting", "reboot", "restart service", + ]), "should use needs-restarting for reboot/service restart assessment" diff --git a/evaluation/with_skills/rh-virt__vm-clone/environment/Dockerfile b/evaluation/with_skills/rh-virt__vm-clone/environment/Dockerfile new file mode 100644 index 00000000..ae625e01 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-clone/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-virt__vm-clone/environment/mcp-servers/mock-virt-mcp.py b/evaluation/with_skills/rh-virt__vm-clone/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..70ce07d7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-clone/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1465 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("prod-vms", {"env": "production"}), + ("test-env", {"env": "testing"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── prod-vms (instruction-specific) ────────────────────────────────── + _vm("production-db", "prod-vms", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true"}, + 8, 16, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-virt__vm-clone/instruction.md b/evaluation/with_skills/rh-virt__vm-clone/instruction.md new file mode 100644 index 00000000..922cf2fb --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-clone/instruction.md @@ -0,0 +1,13 @@ +# VM Cloning Task + +You are an OpenShift Virtualization administrator. The QA team needs an exact copy of the production database VM to test a schema migration. Plan the clone so it can run safely in the test environment without affecting production. + +## Requirements +- Inspect the source VM (`production-db` in `prod-vms`) to understand its current state, storage configuration, and whether it needs to be stopped for cloning +- Determine the right cloning approach: whether the storage backend supports efficient cloning, and whether the VM needs to be offline +- Plan the clone target (`test-db-clone` in `test-env`) ensuring it has no network or storage conflicts with the source +- Verify the clone will be fully independent: separate disks, no shared PVCs, and safe to modify without affecting production + +Document your cloning plan in `/root/report.md`. + +Use MCP tools to examine the cluster. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-virt__vm-clone/solution/solve.sh b/evaluation/with_skills/rh-virt__vm-clone/solution/solve.sh new file mode 100644 index 00000000..17534b7f --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-clone/solution/solve.sh @@ -0,0 +1,34 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Cloning Plan + +## Source: production-db (prod-vms) → Target: test-db-clone (test-env) + +### Cloning Strategy +Using **Clone Storage** (full copy) strategy. Three strategies available: +1. **Clone Storage** (selected) — Full copy of all DataVolumes/PVCs. Independent clone. +2. **Reference Existing** — Shared disk. NOT safe for database workloads. +3. **New Empty Storage** — Fresh disk. Loses data. + +Full copy ensures test-db-clone is completely independent from production-db. + +### Spec Modifications for Clone +- Set `runStrategy: Halted` (don't auto-start the clone) +- Regenerate `domain.firmware.uuid` and `domain.firmware.serial` to avoid conflicts +- Update metadata.name to `test-db-clone` +- Update metadata.namespace to `test-env` +- Update DataVolume names to avoid collision + +### Storage Cloning +- Discover source DataVolumes via label: `vm.kubevirt.io/name=production-db` +- CSI volume cloning support required on the StorageClass +- Create new DataVolume with `source.pvc` referencing the original +- **Reference Existing** = shared disk — data corruption risk if both VMs run +- Use `resources_create_or_update` to create cloned VM and DataVolume + +### Verification +- Check target name `test-db-clone` doesn't exist in `test-env` +- Verify CSI driver supports volume cloning +- Monitor DataVolume clone progress + +REPORT_EOF diff --git a/evaluation/with_skills/rh-virt__vm-clone/task.toml b/evaluation/with_skills/rh-virt__vm-clone/task.toml new file mode 100644 index 00000000..46cd3410 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-clone/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-clone" +name = "rh-virt VM Cloning Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-clone", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-virt__vm-clone/tests/llm_judge.py b/evaluation/with_skills/rh-virt__vm-clone/tests/llm_judge.py new file mode 100644 index 00000000..749897f2 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-clone/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "firmware_uuid_regeneration", "file": "/root/report.md", "question": "Does the report address firmware UUID/serial regeneration to avoid identity conflicts between source and clone?", "reference": "A skilled report regenerates domain.firmware.uuid and serial in the clone spec. An unskilled report clones without addressing firmware identity."}, + {"id": "storage_clone_strategy", "file": "/root/report.md", "question": "Does the report discuss DataVolume clone strategy using source.pvc and StorageClass considerations?", "reference": "A skilled report uses DataVolume with source.pvc and considers CSI clone support. An unskilled report copies data manually."}, + {"id": "halted_run_strategy", "file": "/root/report.md", "question": "Does the report set runStrategy: Halted for the cloned VM to start in Stopped state?", "reference": "A skilled report ensures the clone starts halted. An unskilled report starts the clone immediately, risking conflicts."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-virt__vm-clone/tests/test.sh b/evaluation/with_skills/rh-virt__vm-clone/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-clone/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-virt__vm-clone/tests/test_outputs.py b/evaluation/with_skills/rh-virt__vm-clone/tests/test_outputs.py new file mode 100644 index 00000000..1638de54 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-clone/tests/test_outputs.py @@ -0,0 +1,90 @@ +""" +Tests for rh-virt__vm-clone per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_source_and_target(self): + content = read_report().lower() + has_source = any(t in content for t in ["source", "original", "production"]) + has_target = any(t in content for t in ["clone", "target", "copy", "destination"]) + assert has_source and has_target, "report should identify both a source VM and a clone target" + + +class TestSkillDependent: + def test_storage_class_cloning(self): + """Skill: StorageClass/CSI for PVC cloning strategy.""" + c = read_report().lower() + assert any(t in c for t in ["storageclass", "storage class", "csi", "volume cloning", "pvc clone", "clone support"]), ( + "should mention StorageClass or CSI cloning for clone strategy" + ) + + def test_identity_conflict(self): + """Skill: hostname, cloud-init, SSH key, firmware UUID conflicts between source and clone.""" + c = read_report().lower() + assert any(t in c for t in ["hostname", "cloud-init", "cloud init", "ssh key", "firmware", "uuid", "mac address", "identity conflict"]), ( + "should address identity conflicts (hostname, cloud-init, UUID) between source and clone" + ) + + def test_cross_namespace_rbac(self): + """Skill: RBAC/permissions for cross-namespace cloning.""" + c = read_report().lower() + assert any(t in c for t in ["rbac", "permission", "cross-namespace", "cross namespace", "target namespace", "create virtualmachine"]), ( + "should address RBAC or permissions for cross-namespace cloning" + ) + + def test_data_volume_cloning(self): + """Skill: DataVolume with source PVC for clone provisioning.""" + c = read_report().lower() + assert any(t in c for t in ["datavolume", "data volume", "source.pvc", "source pvc", "pvc datasource", "clone storage"]), ( + "should discuss DataVolume or PVC cloning for clone storage" + ) + + def test_datavolume_progress(self): + """Skill: Monitor DataVolume phase (Pending/Succeeded) during clone.""" + c = read_report().lower() + assert any(t in c for t in ["datavolume", "phase", "pending", "succeeded", "cloning progress", "status.phase"]), ( + "should mention monitoring DataVolume phase during clone" + ) + + def test_firmware_uuid_regeneration(self): + """Skill teaches domain.firmware.uuid and domain.firmware.serial must be + regenerated in clone spec to avoid identity conflicts. Without skill, + agents clone without regenerating firmware identifiers.""" + c = read_report().lower() + assert "firmware" in c and ("uuid" in c or "serial" in c), ( + "should address firmware UUID/serial regeneration for clone" + ) + + def test_run_strategy_halted_for_clone(self): + """Skill teaches runStrategy: Halted ensures cloned VM starts in Stopped state. + Without skill, agents start clone immediately.""" + c = read_report().lower() + assert any(t in c for t in ["halted", "runstrategy", "run strategy"]) and ( + "clone" in c or "stop" in c + ), "should set runStrategy: Halted for cloned VM" + + def test_source_pvc_bound(self): + """Docs teach CSI clone prerequisite: source PVC must be in Bound state. + Without docs, agents attempt cloning from unbound PVCs.""" + c = read_report().lower() + assert any(t in c for t in [ + "bound", "pvc status", "source pvc", "prerequisite", + ]) and ("pvc" in c or "storage" in c), ( + "should verify source PVC is Bound before cloning" + ) diff --git a/evaluation/with_skills/rh-virt__vm-create/environment/Dockerfile b/evaluation/with_skills/rh-virt__vm-create/environment/Dockerfile new file mode 100644 index 00000000..ae625e01 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-create/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-virt__vm-create/environment/mcp-servers/mock-virt-mcp.py b/evaluation/with_skills/rh-virt__vm-create/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..7b17408d --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-create/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1518 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("vm-testing", {"env": "testing"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + +STORAGE_CLASSES = [ + { + "name": "ocs-storagecluster-ceph-rbd", + "provisioner": "openshift-storage.rbd.csi.ceph.com", + "reclaimPolicy": "Delete", + "volumeBindingMode": "Immediate", + "allowVolumeExpansion": True, + }, + { + "name": "ocs-storagecluster-cephfs", + "provisioner": "openshift-storage.cephfs.csi.ceph.com", + "reclaimPolicy": "Delete", + "volumeBindingMode": "Immediate", + "allowVolumeExpansion": False, + }, +] + +VOLUME_SNAPSHOT_CLASSES = [ + { + "name": "ocs-storagecluster-rbdplugin-snapclass", + "driver": "openshift-storage.rbd.csi.ceph.com", + "deletionPolicy": "Delete", + }, +] + + +def _build_storage_class(sc): + """Build a storage.k8s.io/v1 StorageClass resource.""" + res = { + "apiVersion": "storage.k8s.io/v1", + "kind": "StorageClass", + "metadata": { + "name": sc["name"], + "uid": _uid(sc["name"]), + "creationTimestamp": CREATED, + }, + "provisioner": sc["provisioner"], + "reclaimPolicy": sc["reclaimPolicy"], + "volumeBindingMode": sc["volumeBindingMode"], + } + if sc.get("allowVolumeExpansion"): + res["allowVolumeExpansion"] = True + return res + + +def _build_volume_snapshot_class(vsc): + """Build a snapshot.storage.k8s.io/v1 VolumeSnapshotClass resource.""" + return { + "apiVersion": "snapshot.storage.k8s.io/v1", + "kind": "VolumeSnapshotClass", + "metadata": { + "name": vsc["name"], + "uid": _uid(vsc["name"]), + "creationTimestamp": CREATED, + }, + "driver": vsc["driver"], + "deletionPolicy": vsc["deletionPolicy"], + } + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-virt__vm-create/instruction.md b/evaluation/with_skills/rh-virt__vm-create/instruction.md new file mode 100644 index 00000000..f35ed63f --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-create/instruction.md @@ -0,0 +1,14 @@ +# VM Creation Task + +You are an OpenShift Virtualization administrator. The development team needs a new RHEL 9 VM for testing. Provision `test-vm` in the `vm-testing` namespace with appropriate resources. + +## Requirements +- Examine the cluster to determine available node capacity, storage classes, and existing VM templates +- Define the VM specification: 2 CPUs, 4Gi memory, 30Gi root disk, RHEL 9 operating system +- Choose the storage provisioning strategy (which storage class, access mode, volume mode) based on what the cluster offers +- Document what could go wrong during provisioning (e.g., insufficient capacity, storage class not available, image pull failure) and how to handle each case +- Provide the complete VM resource definition + +Document your provisioning plan and VM specification in `/root/report.md`. + +Use MCP tools to examine the cluster. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-virt__vm-create/solution/solve.sh b/evaluation/with_skills/rh-virt__vm-create/solution/solve.sh new file mode 100644 index 00000000..311af1b5 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-create/solution/solve.sh @@ -0,0 +1,71 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Creation Plan + +## Target: test-vm in vm-testing + +### VirtualMachine Specification + +```yaml +apiVersion: kubevirt.io/v1 +kind: VirtualMachine +metadata: + name: test-vm + namespace: vm-testing +spec: + runStrategy: Always + template: + spec: + domain: + cpu: + cores: 2 + resources: + requests: + memory: 4Gi + devices: + disks: + - name: rootdisk + disk: + bus: virtio + volumes: + - name: rootdisk + dataVolume: + name: test-vm-rootdisk + dataVolumeTemplates: + - metadata: + name: test-vm-rootdisk + spec: + source: + registry: + url: docker://registry.redhat.io/rhel9/rhel-guest-image:latest + storage: + resources: + requests: + storage: 30Gi +``` + +### Storage Configuration +- Using DataVolume with registry source for RHEL 9 guest image +- DataVolume automatically provisions PVC via CDI +- Default StorageClass used (annotated with storageclass.kubernetes.io/is-default-class) + +### VM Lifecycle +- `runStrategy: Always` ensures VM starts automatically and restarts on failure +- Alternative: `running: true` for simple start, but runStrategy provides more control +- Instance type/size: small (2 vCPU, 4Gi) for testing purposes + +### Default Credentials +- RHEL 9 guest image: requires cloud-init or SSH key for access + +### Prerequisite Checks +- Verify namespace vm-testing exists +- Check default StorageClass is configured (annotation storageclass.kubernetes.io/is-default-class) +- Verify KubeVirt operator is running +- Ensure sufficient node resources (2 CPU, 4Gi memory) + +### Error Handling (from vm-create skill) +- **ErrorUnschedulable**: Consult scheduling-errors.md; add tolerations via oc patch if node taints block scheduling +- **ErrorDataVolumeNotReady**: Storage provisioning; verify StorageClass, check CDI/DataVolume status +- Access VM: `virtctl console test-vm -n vm-testing` or VNC via OpenShift Console + +REPORT_EOF diff --git a/evaluation/with_skills/rh-virt__vm-create/task.toml b/evaluation/with_skills/rh-virt__vm-create/task.toml new file mode 100644 index 00000000..d6ab031e --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-create/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-create" +name = "rh-virt VM Creation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-create", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-virt__vm-create/tests/llm_judge.py b/evaluation/with_skills/rh-virt__vm-create/tests/llm_judge.py new file mode 100644 index 00000000..8fb930ee --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-create/tests/llm_judge.py @@ -0,0 +1,92 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "unschedulable_handling", "file": "/root/report.md", "question": "Does the report address ErrorUnschedulable and taint/toleration handling for VM placement?", "reference": "A skilled report handles scheduling errors with tolerations. An unskilled report doesn't address placement failures."}, + {"id": "datavolume_provisioning", "file": "/root/report.md", "question": "Does the report describe using DataVolume resources (with CDI) for VM disk provisioning, specifying a source (registry, blank, or PVC)?", "reference": "A skilled report uses DataVolume with a source specification for disk provisioning. An unskilled report creates PVCs manually without CDI integration."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-virt__vm-create/tests/test.sh b/evaluation/with_skills/rh-virt__vm-create/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-create/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-virt__vm-create/tests/test_outputs.py b/evaluation/with_skills/rh-virt__vm-create/tests/test_outputs.py new file mode 100644 index 00000000..5cf84d0d --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-create/tests/test_outputs.py @@ -0,0 +1,71 @@ +""" +Tests for rh-virt__vm-create per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_vm(self): + content = read_report().lower() + assert any(t in content for t in ["vm", "virtual machine", "virtualmachine"]), ( + "report should reference the target VM" + ) + + def test_mentions_namespace(self): + content = read_report().lower() + assert "namespace" in content, "report should mention the target namespace" + + +class TestSkillDependent: + def test_data_volume_provisioning(self): + """Skill: DataVolume for disk provisioning with image/blank source.""" + c = read_report().lower() + assert any(t in c for t in ["datavolume", "data volume", "cdi.kubevirt.io", "source.registry", "source.blank"]), ( + "should discuss DataVolume for disk provisioning" + ) + + def test_storage_class_provisioning(self): + """Skill: StorageClass for DataVolume/PVC provisioning.""" + c = read_report().lower() + assert any(t in c for t in ["storageclass", "storage class", "volumeBindingMode", "provisioner"]) and ( + "storage" in c or "pvc" in c or "datavolume" in c + ), ( + "should mention StorageClass for disk provisioning" + ) + + def test_instance_type_or_workload(self): + """Skill: Instance type (u1.medium) or workload (fedora, rhel) resolution.""" + c = read_report().lower() + assert any(t in c for t in ["instancetype", "instance type", "u1.", "u1.medium", "workload", "fedora", "rhel", "ubuntu", "centos"]), ( + "should reference instance types or workload/OS selection" + ) + + def test_unschedulable_toleration(self): + """Skill: ErrorUnschedulable and toleration workaround.""" + c = read_report().lower() + assert any(t in c for t in ["errorunschedulable", "unschedulable", "taint", "toleration", "scheduling"]) and ( + "taint" in c or "toleration" in c or "unschedulable" in c + ), ( + "should address ErrorUnschedulable and taint/toleration handling" + ) + + def test_yaml_or_manifest(self): + """Should include a YAML manifest or structured spec.""" + content = read_report() + assert "apiVersion" in content or "kind:" in content or "spec:" in content or "```yaml" in content or "```yml" in content, ( + "should include a YAML manifest or structured specification" + ) diff --git a/evaluation/with_skills/rh-virt__vm-delete/environment/Dockerfile b/evaluation/with_skills/rh-virt__vm-delete/environment/Dockerfile new file mode 100644 index 00000000..ae625e01 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-delete/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-virt__vm-delete/environment/mcp-servers/mock-virt-mcp.py b/evaluation/with_skills/rh-virt__vm-delete/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..2aaace7d --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-delete/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1464 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("decommission", {"env": "decommission"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── decommission (instruction-specific) ────────────────────────────── + _vm("legacy-app", "decommission", "hv-prod-dc1-01", "rhel-8.6", None, + {"app": "legacy-app", "criticality": "low", "legacy": "true"}, + 2, 4, "Running", True, 30), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-virt__vm-delete/instruction.md b/evaluation/with_skills/rh-virt__vm-delete/instruction.md new file mode 100644 index 00000000..5769196b --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-delete/instruction.md @@ -0,0 +1,12 @@ +# VM Deletion Task + +You are an OpenShift Virtualization administrator. Plan the safe deletion of VM `legacy-app` in namespace `decommission`. + +## Requirements +- Perform pre-deletion safety checks +- Define the deletion scope (VM only vs VM + storage) +- Include safeguards against accidental deletion + +Use MCP tools to examine the cluster. Document your methodology, findings, and deletion plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-virt__vm-delete/solution/solve.sh b/evaluation/with_skills/rh-virt__vm-delete/solution/solve.sh new file mode 100644 index 00000000..6d87b29d --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-delete/solution/solve.sh @@ -0,0 +1,31 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Deletion Plan + +## Target: legacy-app in decommission + +### Pre-Deletion Safety Checks +1. **Protection label**: Check `metadata.labels.protected` — if `"true"`, deletion is blocked. Remove with `oc label vm legacy-app -n decommission protected-` +2. **Running state**: If VM is running, stop it first via `vm_lifecycle` action=stop +3. **Storage discovery**: List DataVolumes with label `vm.kubevirt.io/name=legacy-app` + +### Deletion Scope Options +- **VM Only** — Keep associated storage (DataVolumes/PVCs) for data recovery +- **VM + Storage** (selected) — Full cleanup of VM and all associated DataVolumes/PVCs + +### Deletion Procedure +1. Verify VM exists and is stopped (use vm_lifecycle action=stop if running) +2. List all associated DataVolumes (apiVersion: cdi.kubevirt.io/v1beta1, labelSelector: vm.kubevirt.io/name=legacy-app) +3. Present deletion scope and storage list +4. **Typed confirmation required**: User must type exact VM name `legacy-app` to proceed +5. Delete VM via resources_delete +6. Delete associated DataVolumes and PVCs via resources_delete +7. Verify deletion completed (resource no longer exists) +8. If VM stuck Terminating: consult lifecycle-errors.md, check finalizers + +### Post-Deletion Verification +- Confirm VM resource is gone +- Confirm DataVolumes and PVCs are cleaned up +- Check for any orphaned resources (finalizers) + +REPORT_EOF diff --git a/evaluation/with_skills/rh-virt__vm-delete/task.toml b/evaluation/with_skills/rh-virt__vm-delete/task.toml new file mode 100644 index 00000000..063c79fd --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-delete/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-delete" +name = "rh-virt VM Deletion Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-delete", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-virt__vm-delete/tests/llm_judge.py b/evaluation/with_skills/rh-virt__vm-delete/tests/llm_judge.py new file mode 100644 index 00000000..e1bed079 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-delete/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "protected_label", "file": "/root/report.md", "question": "Does the report check for protected: true label that blocks deletion?", "reference": "A skilled report checks protection labels. An unskilled report attempts deletion without safety checks."}, + {"id": "storage_scope", "file": "/root/report.md", "question": "Does the report distinguish VM-only vs VM+storage deletion and warn about orphaned PVCs?", "reference": "A skilled report offers storage scope choice. An unskilled report deletes everything without distinction."}, + {"id": "typed_confirmation", "file": "/root/report.md", "question": "Does the report require typed VM name confirmation (exact, case-sensitive) before deletion?", "reference": "A skilled report requires exact name match confirmation. An unskilled report uses yes/no confirmation."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-virt__vm-delete/tests/test.sh b/evaluation/with_skills/rh-virt__vm-delete/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-delete/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-virt__vm-delete/tests/test_outputs.py b/evaluation/with_skills/rh-virt__vm-delete/tests/test_outputs.py new file mode 100644 index 00000000..a1c73806 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-delete/tests/test_outputs.py @@ -0,0 +1,82 @@ +""" +Tests for rh-virt__vm-delete per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_vm(self): + content = read_report().lower() + assert any(t in content for t in ["vm", "virtual machine", "virtualmachine"]), ( + "report should reference the target VM" + ) + + def test_mentions_deletion(self): + content = read_report().lower() + assert "delet" in content, "report should discuss deletion" + + +class TestSkillDependent: + def test_stop_before_delete(self): + """Skill: Must stop VM before deletion; vm_lifecycle stop.""" + c = read_report().lower() + assert any(t in c for t in ["stop before delet", "stop and delete", "vm_lifecycle", "halt", "must stop", "running"]) and ( + "stop" in c or "halt" in c + ), ( + "should require stopping VM before deletion" + ) + + def test_orphan_storage(self): + """Skill: VM-only vs VM+storage; orphan PVCs; delete DataVolume/PVC.""" + c = read_report().lower() + assert any(t in c for t in ["vm only", "vm+storage", "datavolume", "orphan", "preserve storage", "delete storage", "pvc"]) and ( + "storage" in c or "pvc" in c or "datavolume" in c + ), ( + "should address storage scope (VM-only vs VM+storage, orphan PVCs)" + ) + + def test_finalizer_handling(self): + """Skill: Finalizer blocking deletion; stuck Terminating.""" + c = read_report().lower() + assert any(t in c for t in ["finalizer", "terminating", "stuck", "resources_create_or_update", "remove finalizer"]), ( + "should address finalizer handling for stuck deletion" + ) + + def test_typed_confirmation(self): + """Skill: Typed VM name confirmation (exact match) before delete.""" + c = read_report().lower() + assert any(t in c for t in ["type", "typed", "exact name", "confirm", "to confirm"]) and ( + "name" in c or "vm" in c + ), ( + "should require typed VM name confirmation" + ) + + def test_protected_label(self): + """Skill: protected: true label blocks deletion.""" + c = read_report().lower() + assert any(t in c for t in ["protected", "protected label", "metadata.labels", "refuse delet"]), ( + "should address protected label blocking deletion" + ) + + def test_reclaim_policy_retain(self): + """Docs teach PV reclaim policy Retain blocks PVC deletion; must patch PV + to Delete first. Without docs, agents don't handle stuck PVC cleanup.""" + c = read_report().lower() + assert any(t in c for t in [ + "retain", "reclaim", "reclaimpolicy", "reclaim policy", + "patch pv", "delete policy", + ]), "should address PV reclaim policy Retain blocking cleanup" diff --git a/evaluation/with_skills/rh-virt__vm-inventory/environment/Dockerfile b/evaluation/with_skills/rh-virt__vm-inventory/environment/Dockerfile new file mode 100644 index 00000000..ae625e01 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-inventory/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-virt__vm-inventory/environment/mcp-servers/mock-virt-mcp.py b/evaluation/with_skills/rh-virt__vm-inventory/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..2e083d72 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-inventory/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1458 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-virt__vm-inventory/instruction.md b/evaluation/with_skills/rh-virt__vm-inventory/instruction.md new file mode 100644 index 00000000..28107e57 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-inventory/instruction.md @@ -0,0 +1,14 @@ +# VM Inventory Task + +You are an OpenShift Virtualization administrator. Your team needs a complete picture of every VM in the cluster for capacity planning and compliance reporting. + +## Requirements +- List every VM across all namespaces, grouped by namespace +- For each VM report: name, status (Running/Stopped/Paused), CPU and memory allocation, operating system, and IP address if running +- Identify any VMs with issues: stopped unexpectedly, guest agent not responding, degraded conditions, or running end-of-life operating systems +- Summarize totals: how many VMs per namespace, how many running vs stopped, total resource allocation +- Sort results by namespace, then by VM name + +Write the inventory report in `/root/report.md`. + +Use MCP tools to gather VM data. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-virt__vm-inventory/solution/solve.sh b/evaluation/with_skills/rh-virt__vm-inventory/solution/solve.sh new file mode 100644 index 00000000..3473c6d5 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-inventory/solution/solve.sh @@ -0,0 +1,32 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Inventory Report + +## Cluster VM Summary + +| Namespace | VM Name | Status | CPU | Memory | Storage | OS | IP | Node | +|-----------|---------|--------|-----|--------|---------|----|----|------| +| prod-vms | production-db | Running | 4 vCPU, 16Gi | 100Gi | RHEL 9.3 | 10.128.2.15 | worker-01 | +| prod-vms | web-frontend | Running | 2 vCPU, 4Gi | 50Gi | Fedora 39 | 10.128.2.16 | worker-02 | +| dev-vms | dev-test | Stopped | 2 vCPU, 8Gi | 50Gi | Ubuntu 22.04 | — | — | + +### Status Summary +- Running: 2 +- Stopped: 1 +- Total: 3 + +### Data Sources +- VM status: `status.printableStatus` from VirtualMachine resource +- Resource details: Extracted from VirtualMachineInstance (VMI) when running via resources_list (apiVersion kubevirt.io/v1, allNamespaces=true for cluster-wide) +- CPU: `.spec.domain.cpu.sockets` (displayed as vCPU) +- Memory: `.spec.domain.memory.guest` +- Storage: `.status.volumeStatus[].persistentVolumeClaimInfo.capacity.storage` +- OS: `.status.guestOSInfo.prettyName` +- IP: `.status.interfaces[0].ipAddress` +- Node: `.status.nodeName` +- Conditions: Ready, AgentConnected, LiveMigratable + +### Sort Order +Sorted by: Namespace → Status (Running → Pending → Stopped → Failed) → VM Name + +REPORT_EOF diff --git a/evaluation/with_skills/rh-virt__vm-inventory/task.toml b/evaluation/with_skills/rh-virt__vm-inventory/task.toml new file mode 100644 index 00000000..6a756f27 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-inventory/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-inventory" +name = "rh-virt VM Inventory Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-inventory", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-virt__vm-inventory/tests/llm_judge.py b/evaluation/with_skills/rh-virt__vm-inventory/tests/llm_judge.py new file mode 100644 index 00000000..aabb1dab --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-inventory/tests/llm_judge.py @@ -0,0 +1,92 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "vmi_vs_vm_distinction", "file": "/root/report.md", "question": "Does the report distinguish between VirtualMachine (spec/desired state) and VirtualMachineInstance (runtime state) as separate resources to query?", "reference": "A skilled report queries both VM and VMI, understanding VM defines the spec while VMI reflects the running state. An unskilled report only queries VirtualMachine without VMI runtime data."}, + {"id": "status_ordering", "file": "/root/report.md", "question": "Does the report organize or sort VMs by operational status (e.g., Running first, then Pending, Stopped, Failed) rather than just listing alphabetically?", "reference": "A skilled report groups or sorts VMs by status priority. An unskilled report lists VMs in arbitrary order without status-based organization."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-virt__vm-inventory/tests/test.sh b/evaluation/with_skills/rh-virt__vm-inventory/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-inventory/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-virt__vm-inventory/tests/test_outputs.py b/evaluation/with_skills/rh-virt__vm-inventory/tests/test_outputs.py new file mode 100644 index 00000000..16ded70a --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-inventory/tests/test_outputs.py @@ -0,0 +1,67 @@ +""" +Tests for rh-virt__vm-inventory per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_has_structured_data(self): + content = read_report() + has_table = "|" in content and content.count("|") >= 4 + has_list = content.count("- ") >= 5 + assert has_table or has_list, "report should present VM inventory in a structured format (table or list)" + + def test_mentions_namespace(self): + content = read_report().lower() + assert "namespace" in content, "report should organize by namespace" + + +class TestSkillDependent: + def test_vmi_runtime_data(self): + """Skill: Query VirtualMachineInstance (VMI) for running VM runtime data.""" + c = read_report().lower() + assert any(t in c for t in ["virtualmachineinstance", "vmi", "virtual machine instance"]), ( + "should reference VMI for runtime data, not just VirtualMachine" + ) + + def test_resource_format(self): + """Skill: Resources as 'X vCPU, YGi' format, not instance type names like u1.medium.""" + c = read_report().lower() + assert any(t in c for t in ["vcpu", "vcpus"]) and any(t in c for t in ["gi", "gib"]), ( + "should use vCPU/Gi resource format, not instance type names" + ) + + def test_status_based_grouping(self): + """Skill: Sort by namespace, then status (Running > Pending > Stopped > Failed), then name.""" + c = read_report().lower() + status_terms = sum(1 for t in ["running", "stopped", "pending", "failed"] if t in c) + has_organization = any(t in c for t in [ + "group", "sort", "order", "organiz", "by namespace", + "by status", "running first", "namespace", + ]) + assert status_terms >= 2 and has_organization, ( + "should organize VMs with status awareness (Running/Stopped/etc) by namespace" + ) + + def test_conditions_awareness(self): + """Skill: KubeVirt-specific conditions — AgentConnected, LiveMigratable.""" + c = read_report().lower() + assert any(t in c for t in [ + "agentconnected", "agent connected", "agent_connected", + "livemigratable", "live migratable", "live_migratable", + "guest agent", + ]), "should mention KubeVirt-specific conditions (AgentConnected, LiveMigratable)" diff --git a/evaluation/with_skills/rh-virt__vm-lifecycle-manager/environment/Dockerfile b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/environment/Dockerfile new file mode 100644 index 00000000..ae625e01 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-virt__vm-lifecycle-manager/environment/mcp-servers/mock-virt-mcp.py b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..31b95dd3 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1467 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("prod-vms", {"env": "production"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── prod-vms (instruction-specific) ────────────────────────────────── + _vm("web-frontend", "prod-vms", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "customer-facing": "true", "criticality": "high"}, + 4, 8, "Running", True, 1), + _vm("production-db", "prod-vms", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true"}, + 8, 16, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-virt__vm-lifecycle-manager/instruction.md b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/instruction.md new file mode 100644 index 00000000..622a3d38 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/instruction.md @@ -0,0 +1,12 @@ +# VM Lifecycle Operations Task + +You are an OpenShift Virtualization administrator. Plan lifecycle operations for VMs in the cluster: stop `web-frontend` and restart `production-db`, both in namespace `prod-vms`. + +## Requirements +- Define the procedure for each operation +- Address the correct sequencing for restart (not a single atomic operation) +- Include verification steps + +Use MCP tools to examine the cluster. Document your methodology, procedures, and verification steps in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-virt__vm-lifecycle-manager/solution/solve.sh b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/solution/solve.sh new file mode 100644 index 00000000..37e96d65 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/solution/solve.sh @@ -0,0 +1,31 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Lifecycle Operations Plan + +## Operation 1: Stop web-frontend +- Tool: `vm_lifecycle(namespace="prod-vms", name="web-frontend", action="stop")` +- Effect: Sets runStrategy to Halted +- Verify: `status.printableStatus` changes to "Stopped" + +## Operation 2: Restart production-db +Restart requires TWO separate calls to avoid resourceVersion conflicts: +1. `vm_lifecycle(namespace="prod-vms", name="production-db", action="stop")` +2. Wait for `status.printableStatus == "Stopped"` (poll every 5 seconds) +3. `vm_lifecycle(namespace="prod-vms", name="production-db", action="start")` + +### RunStrategy Mapping +| Action | RunStrategy Set | +|--------|----------------| +| start | Always | +| stop | Halted | +| restart | Always (after stop completes) | + +### Caveats +- Restart is NOT a single atomic operation — it's stop + wait + start +- Avoid resourceVersion conflicts: use resources_get to verify printableStatus before start +- Graceful shutdown: VM guest agent handles ACPI shutdown signal +- If VM doesn't stop within timeout, force stop may be needed +- Always verify stopped status before issuing start to avoid conflicts +- Consult lifecycle-errors.md for start/stop failures + +REPORT_EOF diff --git a/evaluation/with_skills/rh-virt__vm-lifecycle-manager/task.toml b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/task.toml new file mode 100644 index 00000000..29808afd --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-lifecycle-manager" +name = "rh-virt VM Lifecycle Management Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-lifecycle-manager", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-virt__vm-lifecycle-manager/tests/llm_judge.py b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/tests/llm_judge.py new file mode 100644 index 00000000..1e8ef2e1 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "two_step_restart", "file": "/root/report.md", "question": "Does the report implement restart as stop→verify stopped→start rather than a single atomic operation?", "reference": "A skilled report separates stop and start to avoid resourceVersion conflicts. An unskilled report uses a single restart command."}, + {"id": "run_strategy_mapping", "file": "/root/report.md", "question": "Does the report map start to RunStrategy: Always and stop to RunStrategy: Halted?", "reference": "A skilled report uses RunStrategy for lifecycle control. An unskilled report uses power state concepts."}, + {"id": "state_verification", "file": "/root/report.md", "question": "Does the report verify VM reached expected state (Stopped/Running) before proceeding to the next operation?", "reference": "A skilled report verifies printableStatus between operations. An unskilled report assumes instant state changes."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-virt__vm-lifecycle-manager/tests/test.sh b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-virt__vm-lifecycle-manager/tests/test_outputs.py b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/tests/test_outputs.py new file mode 100644 index 00000000..98907dad --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-lifecycle-manager/tests/test_outputs.py @@ -0,0 +1,75 @@ +""" +Tests for rh-virt__vm-lifecycle-manager per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_operations(self): + c = read_report().lower() + assert ("stop" in c or "halt" in c) and ("restart" in c or "start" in c), ( + "report should discuss stop and restart operations" + ) + + def test_mentions_vms(self): + c = read_report().lower() + assert any(t in c for t in ["vm", "virtual machine", "virtualmachine"]), ( + "report should reference the target VMs" + ) + + +class TestSkillDependent: + def test_two_step_restart(self): + """Skill: Restart = stop then start (not single atomic); resourceVersion conflict.""" + c = read_report().lower() + assert ("stop" in c and "start" in c) and any(t in c for t in ["two", "separate", "sequence", "then", "first", "resourceversion", "conflict"]), ( + "should explain restart as stop-then-start, not single operation" + ) + + def test_run_strategy_control(self): + """Skill: RunStrategy Always/Halted for start/stop; not generic power state.""" + c = read_report().lower() + assert any(t in c for t in ["runstrategy", "run strategy", "always", "halted"]) and ( + "start" in c or "stop" in c + ), ( + "should map start/stop to RunStrategy (Always/Halted)" + ) + + def test_ready_verification(self): + """Skill: Verify status.printableStatus Stopped/Running after each step.""" + c = read_report().lower() + assert any(t in c for t in ["printablestatus", "printable status", "status", "stopped", "running"]) and ( + any(t in c for t in ["verify", "check", "poll", "wait", "before start"]) + ), ( + "should verify VM reached expected state before proceeding" + ) + + def test_vm_lifecycle_tool(self): + """Skill: vm_lifecycle MCP tool for start/stop/restart.""" + c = read_report().lower() + assert any(t in c for t in ["vm_lifecycle", "vm lifecycle", "lifecycle tool", "mcp"]), ( + "should reference vm_lifecycle or MCP lifecycle tool" + ) + + def test_restart_composite(self): + """Skill: Restart implemented as stop → verify stopped → wait → start.""" + c = read_report().lower() + has_stop_start = "stop" in c and "start" in c + has_wait = any(t in c for t in ["wait", "5 second", "poll", "verify stopped"]) + assert has_stop_start and has_wait, ( + "should include wait/verify between stop and start for restart" + ) diff --git a/evaluation/with_skills/rh-virt__vm-rebalance/environment/Dockerfile b/evaluation/with_skills/rh-virt__vm-rebalance/environment/Dockerfile new file mode 100644 index 00000000..ae625e01 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-rebalance/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-virt__vm-rebalance/environment/mcp-servers/mock-virt-mcp.py b/evaluation/with_skills/rh-virt__vm-rebalance/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..2e083d72 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-rebalance/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1458 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-virt__vm-rebalance/instruction.md b/evaluation/with_skills/rh-virt__vm-rebalance/instruction.md new file mode 100644 index 00000000..b4e5c640 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-rebalance/instruction.md @@ -0,0 +1,13 @@ +# VM Rebalancing Task + +You are an OpenShift Virtualization administrator. Node `hv-prod-dc1-02` is critically overloaded (88% CPU, 82% memory). Plan how to rebalance its workloads by migrating one or more VMs to less utilized nodes. + +## Requirements +- Examine current node utilization and identify which VMs on `hv-prod-dc1-02` are candidates for migration +- Evaluate migration feasibility for each candidate and determine the appropriate migration method +- Select appropriate target nodes based on available capacity and schedulability +- Identify risks and safety considerations that could affect the migration + +Use MCP tools to examine the cluster. Document your methodology, findings, and migration plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-virt__vm-rebalance/solution/solve.sh b/evaluation/with_skills/rh-virt__vm-rebalance/solution/solve.sh new file mode 100644 index 00000000..1f48a04e --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-rebalance/solution/solve.sh @@ -0,0 +1,41 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Rebalancing Plan + +## Current State +Node hv-prod-dc1-02 is critically overloaded: 88% CPU (14080m/16000m), 82% memory (53739Mi/65536Mi). +VMs on this node: vm-web-prod-03, vm-api-prod-01, vm-cache-prod-01, vm-etl-prod-01. + +## Migration Candidates +- vm-web-prod-03 (4 CPU, 8Gi): good candidate, RWX storage supports live migration +- vm-cache-prod-01 (2 CPU, 4Gi): good candidate, small footprint +- vm-etl-prod-01 (4 CPU, 8Gi): degraded (high I/O latency), could benefit from migration but risky during active I/O + +## Live Migration Prerequisites +1. **Storage access mode**: Must be ReadWriteMany (RWX) for live migration. ReadWriteOnce (RWO) requires cold migration (VM must be stopped first). +2. **Node schedulability**: Target node must be schedulable (not cordoned or in maintenance). +3. **CPU model compatibility**: Source and target nodes must support the same CPU model. +4. **Available capacity**: Use allocated vCPU/memory from VM spec, not runtime usage metrics. + +## Target Node Selection +- hv-prod-dc1-01: 74% CPU, 68% memory — can accept one small VM +- hv-prod-dc1-03: cordoned for maintenance — NOT schedulable +- hv-prod-dc2-01/02: different datacenter zone, only for cross-zone rebalancing + +Recommendation: Migrate vm-cache-prod-01 (2 CPU, 4Gi) to hv-prod-dc1-01. + +## Anti-Patterns to Avoid +- **No ping-pong**: Don't migrate VMs back and forth between nodes repeatedly +- **Avoid resource overcommit**: Calculate post-migration allocated resources to ensure target stays below 85% +- **Don't migrate during peak hours**: Schedule during maintenance windows +- **Cold migration caution**: Re-read VM before updating nodeAffinity to avoid resourceVersion conflict +- **Overcommit warning**: If any node exceeds 85% after rebalance, escalate + +## Migration Procedure +1. Verify vm-cache-prod-01 storage is RWX (live migration supported) +2. Verify hv-prod-dc1-01 has capacity for 2 CPU + 4Gi after migration +3. Create VirtualMachineInstanceMigration resource +4. Monitor migration progress for convergence +5. Verify VM is healthy on target node post-migration + +REPORT_EOF diff --git a/evaluation/with_skills/rh-virt__vm-rebalance/task.toml b/evaluation/with_skills/rh-virt__vm-rebalance/task.toml new file mode 100644 index 00000000..d79dfbba --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-rebalance/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-rebalance" +name = "rh-virt VM Rebalancing Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-rebalance", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-virt__vm-rebalance/tests/llm_judge.py b/evaluation/with_skills/rh-virt__vm-rebalance/tests/llm_judge.py new file mode 100644 index 00000000..76052f1f --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-rebalance/tests/llm_judge.py @@ -0,0 +1,92 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "cpu_compatibility_check", "file": "/root/report.md", "question": "Does the report check CPU model or feature compatibility between source and target nodes before recommending migration?", "reference": "A skilled report verifies CPU compatibility (model, features) to ensure live migration success. An unskilled report migrates VMs without CPU compatibility checks."}, + {"id": "overcommit_awareness", "file": "/root/report.md", "question": "Does the report assess overcommit risk (whether the target node will exceed capacity after receiving migrated VMs)?", "reference": "A skilled report calculates whether the target node can handle the additional load without overcommitting. An unskilled report moves VMs without capacity verification."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-virt__vm-rebalance/tests/test.sh b/evaluation/with_skills/rh-virt__vm-rebalance/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-rebalance/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-virt__vm-rebalance/tests/test_outputs.py b/evaluation/with_skills/rh-virt__vm-rebalance/tests/test_outputs.py new file mode 100644 index 00000000..ea445584 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-rebalance/tests/test_outputs.py @@ -0,0 +1,57 @@ +""" +Tests for rh-virt__vm-rebalance per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_migration(self): + content = read_report().lower() + assert "migrat" in content, "report should discuss migration" + + def test_mentions_node(self): + content = read_report().lower() + assert any(t in content for t in ["node", "overload", "imbalance", "utilization"]), ( + "report should reference cluster nodes or load imbalance" + ) + + +class TestSkillDependent: + def test_cpu_compatibility(self): + """Skill: CPU model/feature compatibility between source and target nodes.""" + c = read_report().lower() + assert any(t in c for t in ["cpu model", "cpu compatible", "cpu feature", "cpu architecture", "migration compatibility"]) or ( + "cpu" in c and ("compatib" in c or "model" in c) + ), ( + "should address CPU compatibility for migration" + ) + + def test_virtualmachineinstancemigration(self): + """Skill: VirtualMachineInstanceMigration for live migration.""" + c = read_report().lower() + assert any(t in c for t in ["virtualmachineinstancemigration", "vmi migration", "migration cr", "migration resource"]), ( + "should reference VirtualMachineInstanceMigration API" + ) + + def test_overcommit_warning(self): + """Skill: Overcommit detection; warn if node exceeds 100% after rebalance.""" + c = read_report().lower() + assert any(t in c for t in ["overcommit", "over commit", "exceed 100", "capacity"]) and ( + "overcommit" in c or "100" in c or "exceed" in c + ), ( + "should address overcommit risk when rebalancing" + ) diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-create/environment/Dockerfile b/evaluation/with_skills/rh-virt__vm-snapshot-create/environment/Dockerfile new file mode 100644 index 00000000..ae625e01 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-create/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-create/environment/mcp-servers/mock-virt-mcp.py b/evaluation/with_skills/rh-virt__vm-snapshot-create/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..912fb2d6 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-create/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1539 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("prod-vms", {"env": "production"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── prod-vms (instruction-specific) ────────────────────────────────── + _vm("production-db", "prod-vms", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true"}, + 8, 16, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + +STORAGE_CLASSES = [ + { + "name": "ocs-storagecluster-ceph-rbd", + "provisioner": "openshift-storage.rbd.csi.ceph.com", + "reclaimPolicy": "Delete", + "volumeBindingMode": "Immediate", + "allowVolumeExpansion": True, + }, + { + "name": "ocs-storagecluster-cephfs", + "provisioner": "openshift-storage.cephfs.csi.ceph.com", + "reclaimPolicy": "Delete", + "volumeBindingMode": "Immediate", + "allowVolumeExpansion": False, + }, +] + +VOLUME_SNAPSHOT_CLASSES = [ + { + "name": "ocs-storagecluster-rbdplugin-snapclass", + "driver": "openshift-storage.rbd.csi.ceph.com", + "deletionPolicy": "Delete", + }, +] + + +def _build_storage_class(sc): + """Build a storage.k8s.io/v1 StorageClass resource.""" + res = { + "apiVersion": "storage.k8s.io/v1", + "kind": "StorageClass", + "metadata": { + "name": sc["name"], + "uid": _uid(sc["name"]), + "creationTimestamp": CREATED, + }, + "provisioner": sc["provisioner"], + "reclaimPolicy": sc["reclaimPolicy"], + "volumeBindingMode": sc["volumeBindingMode"], + } + if sc.get("allowVolumeExpansion"): + res["allowVolumeExpansion"] = True + return res + + +def _build_volume_snapshot_class(vsc): + """Build a snapshot.storage.k8s.io/v1 VolumeSnapshotClass resource.""" + return { + "apiVersion": "snapshot.storage.k8s.io/v1", + "kind": "VolumeSnapshotClass", + "metadata": { + "name": vsc["name"], + "uid": _uid(vsc["name"]), + "creationTimestamp": CREATED, + }, + "driver": vsc["driver"], + "deletionPolicy": vsc["deletionPolicy"], + } + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "storage.k8s.io/v1" and kind == "StorageClass": + resources = [_build_storage_class(sc) for sc in STORAGE_CLASSES] + headers = ["NAME", "PROVISIONER", "RECLAIMPOLICY", "VOLUMEBINDINGMODE", "ALLOWVOLUMEEXPANSION", "AGE"] + def row(r): + return [r["metadata"]["name"], r["provisioner"], + r["reclaimPolicy"], r["volumeBindingMode"], + str(r.get("allowVolumeExpansion", False)), "90d"] + return resources, headers, row, False + + if api_version == "snapshot.storage.k8s.io/v1" and kind == "VolumeSnapshotClass": + resources = [_build_volume_snapshot_class(vsc) for vsc in VOLUME_SNAPSHOT_CLASSES] + headers = ["NAME", "DRIVER", "DELETIONPOLICY", "AGE"] + def row(r): + return [r["metadata"]["name"], r["driver"], r["deletionPolicy"], "90d"] + return resources, headers, row, False + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-create/instruction.md b/evaluation/with_skills/rh-virt__vm-snapshot-create/instruction.md new file mode 100644 index 00000000..34f38f23 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-create/instruction.md @@ -0,0 +1,12 @@ +# VM Snapshot Creation Task + +You are an OpenShift Virtualization administrator. Create a snapshot of VM `production-db` in namespace `prod-vms`. + +## Requirements +- Verify snapshot prerequisites (storage support, guest agent) +- Define the snapshot specification +- Address snapshot consistency levels and monitoring + +Use MCP tools to examine the cluster. Work autonomously — do not wait for user confirmation at any step. Document your methodology, findings, and snapshot plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-create/solution/solve.sh b/evaluation/with_skills/rh-virt__vm-snapshot-create/solution/solve.sh new file mode 100644 index 00000000..22659dde --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-create/solution/solve.sh @@ -0,0 +1,39 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Snapshot Plan + +## Target: production-db in prod-vms + +### Storage Snapshot Support Checks +1. Check VM `status.volumeSnapshotStatuses` for snapshot support +2. Verify no hot-plugged volumes (block snapshots - must stop VM and persist or remove) +3. Check StorageClass has a VolumeSnapshotClass +4. Verify CSI driver supports snapshots +5. Check for guest agent (determines consistency level) +6. Create via resources_create_or_update; poll status.phase (InProgress/Succeeded/Failed) and status.readyToUse + +### Snapshot Type +- **With guest agent**: Application-consistent (freeze/thaw of filesystem) + - `status.indications` will show `GuestAgent` +- **Without guest agent**: Crash-consistent (point-in-time disk state) + - `status.indications` will show `Online` only + +### VirtualMachineSnapshot YAML +```yaml +apiVersion: snapshot.kubevirt.io/v1beta1 +kind: VirtualMachineSnapshot +metadata: + name: production-db-backup-20240301 + namespace: prod-vms +spec: + source: + apiGroup: kubevirt.io + kind: VirtualMachine + name: production-db +``` + +### Monitoring +- Poll `status.phase`: InProgress → Succeeded or Failed +- Check `status.readyToUse: true` before relying on snapshot + +REPORT_EOF diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-create/task.toml b/evaluation/with_skills/rh-virt__vm-snapshot-create/task.toml new file mode 100644 index 00000000..c563a3ed --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-create/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-snapshot-create" +name = "rh-virt VM Snapshot Creation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-snapshot-create", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-create/tests/llm_judge.py b/evaluation/with_skills/rh-virt__vm-snapshot-create/tests/llm_judge.py new file mode 100644 index 00000000..cf067a9c --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-create/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "volume_snapshot_class", "file": "/root/report.md", "question": "Does the report check for VolumeSnapshotClass as a prerequisite for CSI snapshot support?", "reference": "A skilled report verifies VolumeSnapshotClass exists. An unskilled report attempts snapshots without checking prerequisites."}, + {"id": "hot_plugged_blocker", "file": "/root/report.md", "question": "Does the report note that hot-plugged volumes block snapshot creation entirely?", "reference": "A skilled report checks for hot-plugged volumes. An unskilled report doesn't know about this blocker."}, + {"id": "consistency_levels", "file": "/root/report.md", "question": "Does the report distinguish application-consistent (GuestAgent) from crash-consistent (Online only) snapshots?", "reference": "A skilled report checks status.indications for GuestAgent presence. An unskilled report doesn't distinguish consistency levels."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-create/tests/test.sh b/evaluation/with_skills/rh-virt__vm-snapshot-create/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-create/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-create/tests/test_outputs.py b/evaluation/with_skills/rh-virt__vm-snapshot-create/tests/test_outputs.py new file mode 100644 index 00000000..c4189fb6 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-create/tests/test_outputs.py @@ -0,0 +1,77 @@ +""" +Tests for rh-virt__vm-snapshot-create per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_snapshot(self): + content = read_report().lower() + assert "snapshot" in content, "report should mention snapshots" + + def test_mentions_vm(self): + content = read_report().lower() + assert any(t in content for t in ["vm", "virtual machine", "virtualmachine"]), ( + "report should reference the target VM" + ) + + +class TestSkillDependent: + def test_volume_snapshot_class(self): + """Skill: VolumeSnapshotClass prerequisite for CSI snapshot support.""" + c = read_report().lower() + assert any(t in c for t in ["volumesnapshotclass", "volume snapshot class", "snapshot class", "csi driver"]), ( + "should mention VolumeSnapshotClass for snapshot prerequisites" + ) + + def test_quiesce_consistency(self): + """Skill: Quiesce/freeze for application-consistent snapshots; guest agent.""" + c = read_report().lower() + assert any(t in c for t in ["quiesce", "freeze", "thaw", "guest agent", "application-consistent", "qemu-guest-agent"]), ( + "should discuss quiesce/freeze for consistency" + ) + + def test_snapshot_cr_structure(self): + """Skill: VirtualMachineSnapshot CR with spec.source.""" + c = read_report().lower() + assert "virtualmachinesnapshot" in c and any(t in c for t in ["spec", "source", "snapshot.kubevirt", "apiversion"]), ( + "should define VirtualMachineSnapshot resource structure" + ) + + def test_hot_plugged_blocker(self): + """Skill: Hot-plugged volumes block snapshot creation.""" + c = read_report().lower() + assert any(t in c for t in ["hot-plug", "hotplug", "hot plug", "block snapshot", "cannot snapshot"]), ( + "should address hot-plugged volumes blocking snapshots" + ) + + def test_status_indications(self): + """Skill: status.indications (GuestAgent, Online) for consistency level.""" + c = read_report().lower() + assert any(t in c for t in ["indications", "guestagent", "online", "status.phase", "inprogress", "succeeded"]), ( + "should reference snapshot status/indications" + ) + + def test_guest_agent_connected_check(self): + """Docs teach checking AgentConnected condition to determine if + application-consistent (vs crash-consistent) snapshots are possible. + Without docs, agents don't check guest agent status before snapshot.""" + c = read_report().lower() + assert any(t in c for t in [ + "agentconnected", "agent connected", "guest agent", + "application-consistent", "crash-consistent", + ]), "should check AgentConnected for snapshot consistency level" diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-delete/environment/Dockerfile b/evaluation/with_skills/rh-virt__vm-snapshot-delete/environment/Dockerfile new file mode 100644 index 00000000..ae625e01 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-delete/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-delete/environment/mcp-servers/mock-virt-mcp.py b/evaluation/with_skills/rh-virt__vm-snapshot-delete/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..2e083d72 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-delete/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1458 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-delete/instruction.md b/evaluation/with_skills/rh-virt__vm-snapshot-delete/instruction.md new file mode 100644 index 00000000..3058c144 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-delete/instruction.md @@ -0,0 +1,12 @@ +# VM Snapshot Deletion Task + +You are an OpenShift Virtualization administrator. Delete snapshot `production-db-backup-20240215` for VM `production-db` in namespace `prod-vms`. + +## Requirements +- Verify the snapshot is safe to delete (no active restores, not the last snapshot) +- Include user confirmation safeguards +- Verify deletion completed + +Use MCP tools to examine the cluster. Document your methodology, findings, and deletion plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-delete/solution/solve.sh b/evaluation/with_skills/rh-virt__vm-snapshot-delete/solution/solve.sh new file mode 100644 index 00000000..11098bb3 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-delete/solution/solve.sh @@ -0,0 +1,26 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Snapshot Deletion Plan + +## Target: production-db-backup-20240215 + +### Safety Checks +1. **Restore conflict check**: Verify no active VirtualMachineRestore references this snapshot + - If snapshot is in use by a restore operation, deletion will fail +2. **Last snapshot warning**: List all snapshots for production-db + - Other snapshots exist (production-db-backup-20240301) — NOT the last snapshot + - If this were the only remaining snapshot, show explicit warning + +### Deletion Procedure +1. Verify snapshot exists (apiVersion: snapshot.kubevirt.io/v1beta1, kind: VirtualMachineSnapshot) +2. Check for active VirtualMachineRestore resources (snapshot in use blocks deletion) +3. List other snapshots for production-db via labelSelector vm.kubevirt.io/name +4. Request user confirmation (proceed yes/no) +5. Delete snapshot via resources_delete +6. Verify deletion completed +7. Impact: Storage freed, recovery point removed + +### Note +This is NOT the last snapshot — production-db-backup-20240301 remains available for restore. + +REPORT_EOF diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-delete/task.toml b/evaluation/with_skills/rh-virt__vm-snapshot-delete/task.toml new file mode 100644 index 00000000..7d13e981 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-delete/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-snapshot-delete" +name = "rh-virt VM Snapshot Deletion Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-snapshot-delete", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-delete/tests/llm_judge.py b/evaluation/with_skills/rh-virt__vm-snapshot-delete/tests/llm_judge.py new file mode 100644 index 00000000..92546360 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-delete/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "restore_conflict", "file": "/root/report.md", "question": "Does the report check for active VirtualMachineRestore before deleting a snapshot?", "reference": "A skilled report checks for active restores. An unskilled report deletes without checking conflicts."}, + {"id": "last_snapshot_warning", "file": "/root/report.md", "question": "Does the report warn when deleting the only remaining snapshot for a VM?", "reference": "A skilled report warns about loss of last recovery point. An unskilled report deletes without warning."}, + {"id": "label_selector_filter", "file": "/root/report.md", "question": "Does the report use spec.source.name or vm.kubevirt.io/name label to list other snapshots for the same VM?", "reference": "A skilled report uses proper filtering to find related snapshots. An unskilled report lists all snapshots without VM filtering."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-delete/tests/test.sh b/evaluation/with_skills/rh-virt__vm-snapshot-delete/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-delete/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-delete/tests/test_outputs.py b/evaluation/with_skills/rh-virt__vm-snapshot-delete/tests/test_outputs.py new file mode 100644 index 00000000..f7220d55 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-delete/tests/test_outputs.py @@ -0,0 +1,71 @@ +""" +Tests for rh-virt__vm-snapshot-delete per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_snapshot(self): + content = read_report().lower() + assert "snapshot" in content, "report should mention snapshots" + + def test_mentions_deletion(self): + content = read_report().lower() + assert "delet" in content, "report should discuss deletion" + + +class TestSkillDependent: + def test_restore_conflict_check(self): + """Skill: Active VirtualMachineRestore blocks snapshot deletion.""" + c = read_report().lower() + assert any(t in c for t in ["virtualmachinerestore", "restore", "in use", "active restore", "block delet"]) and ( + "restore" in c or "conflict" in c + ), ( + "should check for active restore blocking deletion" + ) + + def test_last_snapshot_warning(self): + """Skill: Warn when deleting the only snapshot for a VM.""" + c = read_report().lower() + assert any(t in c for t in ["last snapshot", "only snapshot", "no recovery", "only remaining", "no other snapshot"]) or ( + "last" in c and "snapshot" in c and ("warn" in c or "only" in c) + ), ( + "should warn when deleting the last snapshot for a VM" + ) + + def test_storage_reclaim(self): + """Skill: Storage freed by deletion; recovery point lost.""" + c = read_report().lower() + assert any(t in c for t in ["storage freed", "storage reclaim", "freed", "recovery point"]), ( + "should mention storage reclamation or recovery point loss" + ) + + def test_virtualmachinesnapshot_delete(self): + """Skill: Delete VirtualMachineSnapshot resource.""" + c = read_report().lower() + assert any(t in c for t in ["virtualmachinesnapshot", "resources_delete", "delete snapshot"]) and ( + "snapshot" in c + ), ( + "should reference VirtualMachineSnapshot deletion" + ) + + def test_list_other_snapshots(self): + """Skill: List other snapshots for same VM before delete.""" + c = read_report().lower() + assert any(t in c for t in ["spec.source.name", "label selector", "vm.kubevirt.io/name", "other snapshot", "list snapshot", "same vm"]), ( + "should list other snapshots for the VM" + ) diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-list/environment/Dockerfile b/evaluation/with_skills/rh-virt__vm-snapshot-list/environment/Dockerfile new file mode 100644 index 00000000..ae625e01 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-list/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-list/environment/mcp-servers/mock-virt-mcp.py b/evaluation/with_skills/rh-virt__vm-snapshot-list/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..1d1132df --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-list/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1500 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("prod-vms", {"env": "production"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── prod-vms (instruction-specific) ────────────────────────────────── + _vm("production-db", "prod-vms", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true"}, + 8, 16, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, + # ── prod-vms / production-db (instruction-specific) ─────────────────── + { + "name": "production-db-backup-20260210", + "namespace": "prod-vms", + "vm_name": "production-db", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-10T08:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-proddb-root-20260210"}, + ], + }, + { + "name": "production-db-snap-20260218", + "namespace": "prod-vms", + "vm_name": "production-db", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-18T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-proddb-root-20260218"}, + ], + }, + { + "name": "production-db-snap-failed", + "namespace": "prod-vms", + "vm_name": "production-db", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-22T11:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-list/instruction.md b/evaluation/with_skills/rh-virt__vm-snapshot-list/instruction.md new file mode 100644 index 00000000..2c6ed187 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-list/instruction.md @@ -0,0 +1,12 @@ +# VM Snapshot Listing Task + +You are an OpenShift Virtualization administrator. List and inspect all snapshots for VM `production-db` in namespace `prod-vms`. + +## Requirements +- List all snapshots with their status and readiness +- Show creation timestamps +- Identify any failed or incomplete snapshots + +Use MCP tools to query snapshot data. Document your methodology and write the snapshot inventory in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-list/solution/solve.sh b/evaluation/with_skills/rh-virt__vm-snapshot-list/solution/solve.sh new file mode 100644 index 00000000..2e33f350 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-list/solution/solve.sh @@ -0,0 +1,30 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Snapshot Inventory + +## Snapshots for production-db in prod-vms + +### Query Method +- API: `resources_list(apiVersion="snapshot.kubevirt.io/v1beta1", kind="VirtualMachineSnapshot", namespace="prod-vms")` +- Filter: `labelSelector: vm.kubevirt.io/name=production-db` +- Fallback: If label missing, filter by `spec.source.name == "production-db"` + +### Snapshot List +| Name | Status | Ready | Created | +|------|--------|-------|---------| +| production-db-backup-20240301 | Succeeded | true | 2024-03-01T10:00:00Z | +| production-db-backup-20240215 | Succeeded | true | 2024-02-15T08:30:00Z | + +### Status Fields +- `status.phase`: InProgress, Succeeded, Failed +- `status.readyToUse`: true/false — snapshot can be used for restore +- `spec.source.name`: Source VM name +- `metadata.creationTimestamp`: Creation time + +### Actions +- Restore: "Restore VM production-db from snapshot " +- Delete: "Delete snapshot " + +### No failed or incomplete snapshots found. + +REPORT_EOF diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-list/task.toml b/evaluation/with_skills/rh-virt__vm-snapshot-list/task.toml new file mode 100644 index 00000000..3e9cc1cd --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-list/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-snapshot-list" +name = "rh-virt VM Snapshot Listing Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-snapshot-list", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-list/tests/llm_judge.py b/evaluation/with_skills/rh-virt__vm-snapshot-list/tests/llm_judge.py new file mode 100644 index 00000000..aa42d89d --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-list/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "ready_to_use_status", "file": "/root/report.md", "question": "Does the report show readyToUse status indicating which snapshots are safe to restore?", "reference": "A skilled report includes readyToUse for each snapshot. An unskilled report only shows names and dates."}, + {"id": "phase_and_creation", "file": "/root/report.md", "question": "Does the report show status.phase (Succeeded/Failed/InProgress) and creation timestamp for each snapshot?", "reference": "A skilled report includes phase and timestamp. An unskilled report shows minimal snapshot metadata."}, + {"id": "label_selector_filtering", "file": "/root/report.md", "question": "Does the report mention using the vm.kubevirt.io/name label or label selector to filter or identify snapshots belonging to a specific VM?", "reference": "A skilled report references the vm.kubevirt.io/name label for filtering snapshots by source VM, or shows label selector parameters in API calls. An unskilled report lists snapshots without mentioning the KubeVirt label-based filtering mechanism."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-list/tests/test.sh b/evaluation/with_skills/rh-virt__vm-snapshot-list/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-list/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-list/tests/test_outputs.py b/evaluation/with_skills/rh-virt__vm-snapshot-list/tests/test_outputs.py new file mode 100644 index 00000000..06ac48d3 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-list/tests/test_outputs.py @@ -0,0 +1,62 @@ +""" +Tests for rh-virt__vm-snapshot-list per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_snapshots(self): + content = read_report().lower() + assert "snapshot" in content, "report should mention snapshots" + + def test_has_structured_output(self): + content = read_report() + assert "|" in content or "- " in content, "report should have structured output (table or list)" + + +class TestSkillDependent: + def test_ready_to_use_status(self): + """Skill: readyToUse status for restore readiness.""" + c = read_report().lower() + assert any(t in c for t in ["readytouse", "ready to use", "ready for restore"]), ( + "should reference readyToUse status for snapshot readiness" + ) + + def test_creation_timestamp(self): + """Skill: metadata.creationTimestamp or creation time.""" + c = read_report().lower() + assert any(t in c for t in ["creationtimestamp", "creation timestamp", "created", "when"]), ( + "should show creation timestamp for each snapshot" + ) + + def test_phase_status(self): + """Skill: status.phase (Succeeded, Failed, InProgress).""" + c = read_report().lower() + assert any(t in c for t in ["succeeded", "failed", "inprogress", "status.phase", "phase"]) and ( + "succeeded" in c or "failed" in c or "phase" in c + ), ( + "should show phase (Succeeded/Failed/InProgress)" + ) + + def test_label_selector_for_vm_filtering(self): + """Skill teaches using vm.kubevirt.io/name label selector to + filter snapshots by source VM. Without skill, agents list all + snapshots without label-based filtering.""" + c = read_report() + assert "vm.kubevirt.io" in c or "labelSelector" in c or "label selector" in c.lower(), ( + "should reference vm.kubevirt.io/name label for snapshot filtering" + ) diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-restore/environment/Dockerfile b/evaluation/with_skills/rh-virt__vm-snapshot-restore/environment/Dockerfile new file mode 100644 index 00000000..ae625e01 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-restore/environment/Dockerfile @@ -0,0 +1,70 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY skills /root/.claude/skills +COPY docs /root/.claude/docs +COPY skills /root/.codex/skills +COPY docs /root/.codex/docs +COPY skills /root/.opencode/skill +COPY docs /root/.opencode/docs +COPY skills /root/.goose/skills +COPY docs /root/.goose/docs +COPY skills /root/.factory/skills +COPY docs /root/.factory/docs +COPY skills /root/.agents/skills +COPY docs /root/.agents/docs +COPY skills /root/.gemini/skills +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-restore/environment/mcp-servers/mock-virt-mcp.py b/evaluation/with_skills/rh-virt__vm-snapshot-restore/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..2e083d72 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-restore/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1458 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-restore/instruction.md b/evaluation/with_skills/rh-virt__vm-snapshot-restore/instruction.md new file mode 100644 index 00000000..d28e79fd --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-restore/instruction.md @@ -0,0 +1,12 @@ +# VM Snapshot Restore Task + +You are an OpenShift Virtualization administrator. Restore VM `production-db` from snapshot `production-db-backup-20240301` in namespace `prod-vms`. + +## Requirements +- Verify snapshot is ready and valid +- Address VM state requirements for restore +- Include safeguards (this is a destructive operation) + +Use MCP tools to examine the cluster. Document your methodology, findings, and restore plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-restore/solution/solve.sh b/evaluation/with_skills/rh-virt__vm-snapshot-restore/solution/solve.sh new file mode 100644 index 00000000..d4698552 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-restore/solution/solve.sh @@ -0,0 +1,42 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Snapshot Restore Plan + +## Restore production-db from production-db-backup-20240301 + +### Prerequisites +1. Verify snapshot exists and `status.phase == "Succeeded"` and `status.readyToUse == true` +2. **VM must be stopped** before restore — use `vm_lifecycle` action=stop +3. Verify no active VirtualMachineRestore in progress + +### VirtualMachineRestore YAML +```yaml +apiVersion: snapshot.kubevirt.io/v1beta1 +kind: VirtualMachineRestore +metadata: + name: restore-production-db-20240301 + namespace: prod-vms +spec: + target: + apiGroup: kubevirt.io + kind: VirtualMachine + name: production-db + virtualMachineSnapshotName: production-db-backup-20240301 +``` + +### Procedure +1. Stop VM production-db +2. Verify snapshot is ready (readyToUse: true) +3. **Typed confirmation**: Type snapshot name for safety +4. Create VirtualMachineRestore resource +5. Monitor restore progress (poll status.phase) +6. Start VM after restore completes + +### Warning +- Restore **overwrites** current VM state with snapshot state +- All changes since snapshot will be lost +- **Typed confirmation**: User must type exact snapshot name +- Monitor VirtualMachineRestore status.complete +- Create via resources_create_or_update + +REPORT_EOF diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-restore/task.toml b/evaluation/with_skills/rh-virt__vm-snapshot-restore/task.toml new file mode 100644 index 00000000..bf15ebed --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-restore/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-snapshot-restore" +name = "rh-virt VM Snapshot Restore Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-snapshot-restore", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-restore/tests/llm_judge.py b/evaluation/with_skills/rh-virt__vm-snapshot-restore/tests/llm_judge.py new file mode 100644 index 00000000..0a348593 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-restore/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "vm_stopped_prerequisite", "file": "/root/report.md", "question": "Does the report require the VM to be stopped before restore and explain this is enforced by the platform?", "reference": "A skilled report enforces stop-before-restore. An unskilled report doesn't mention this prerequisite."}, + {"id": "restore_cr_definition", "file": "/root/report.md", "question": "Does the report define a VirtualMachineRestore CR with virtualMachineSnapshotName reference?", "reference": "A skilled report creates proper VirtualMachineRestore resource. An unskilled report doesn't know the restore API."}, + {"id": "destructive_warning_and_verification", "file": "/root/report.md", "question": "Does the report warn about data loss (changes since snapshot) and verify restore completion via status.complete?", "reference": "A skilled report warns about destructive nature and verifies completion. An unskilled report restores without warnings."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-restore/tests/test.sh b/evaluation/with_skills/rh-virt__vm-snapshot-restore/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-restore/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/with_skills/rh-virt__vm-snapshot-restore/tests/test_outputs.py b/evaluation/with_skills/rh-virt__vm-snapshot-restore/tests/test_outputs.py new file mode 100644 index 00000000..e02b5cf9 --- /dev/null +++ b/evaluation/with_skills/rh-virt__vm-snapshot-restore/tests/test_outputs.py @@ -0,0 +1,71 @@ +""" +Tests for rh-virt__vm-snapshot-restore per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_restore(self): + content = read_report().lower() + assert "restor" in content, "report should discuss restore operation" + + def test_mentions_snapshot(self): + content = read_report().lower() + assert "snapshot" in content or "backup" in content, "report should mention the snapshot" + + +class TestSkillDependent: + def test_vm_stopped_prerequisite(self): + """Skill: VM must be stopped before restore; stop-and-restore option.""" + c = read_report().lower() + assert any(t in c for t in ["stop before restor", "must be stopped", "stop-and-restore", "vm must be stopped", "halt"]) and ( + "stop" in c and "restor" in c + ), ( + "should require VM stopped before restore" + ) + + def test_destructive_warning(self): + """Skill: Data loss warning; changes since snapshot will be lost.""" + c = read_report().lower() + assert any(t in c for t in ["data loss", "changes since", "will be lost", "overwrite", "destructive", "replace current", "cannot recover"]), ( + "should warn about data loss from restore" + ) + + def test_restore_cr(self): + """Skill: VirtualMachineRestore CR with target and snapshot reference.""" + c = read_report().lower() + assert "virtualmachinerestore" in c and any(t in c for t in ["target", "virtualmachinesnapshotname", "spec"]), ( + "should define VirtualMachineRestore resource" + ) + + def test_post_restore_verification(self): + """Skill: Verify restore complete; status.complete; start VM after.""" + c = read_report().lower() + assert any(t in c for t in ["status.complete", "restore complete", "post-restore", "after restore", "start vm", "start the vm"]) and ( + "restor" in c or "complete" in c or "start" in c + ), ( + "should include post-restore verification or start step" + ) + + def test_typed_confirmation(self): + """Skill: Typed snapshot name confirmation before restore.""" + c = read_report().lower() + assert any(t in c for t in ["type", "typed", "exact name", "to confirm", "snapshot name"]) and ( + "confirm" in c or "type" in c + ), ( + "should require typed snapshot name confirmation" + ) diff --git a/evaluation/without_skills/ocp-admin__cluster-report/environment/Dockerfile b/evaluation/without_skills/ocp-admin__cluster-report/environment/Dockerfile new file mode 100644 index 00000000..5fe00bae --- /dev/null +++ b/evaluation/without_skills/ocp-admin__cluster-report/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-ocp-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/ocp-admin__cluster-report/environment/mcp-servers/mock-ocp-mcp.py b/evaluation/without_skills/ocp-admin__cluster-report/environment/mcp-servers/mock-ocp-mcp.py new file mode 100644 index 00000000..65e0b6b5 --- /dev/null +++ b/evaluation/without_skills/ocp-admin__cluster-report/environment/mcp-servers/mock-ocp-mcp.py @@ -0,0 +1,304 @@ +#!/usr/bin/env python3 + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +CONTEXTS = [ + ("prod-us-east", "https://api.prod-us-east.example.com:6443", "OpenShift 4.16.3", 6, "high"), + ("prod-eu-west", "https://api.prod-eu-west.example.com:6443", "OpenShift 4.15.12", 4, "moderate"), + ("staging-central", "https://api.staging-central.example.com:6443", "OpenShift 4.16.1", 3, "low"), + ("dev-k8s", "https://dev-k8s.internal.example.com:6443", "Kubernetes", 2, "low"), + ("legacy-dc", "https://legacy-dc.example.com:6443", "OpenShift 4.14", 5, "unknown"), +] + +UNREACHABLE = {"legacy-dc"} +OPENSHIFT_CONTEXTS = {"prod-us-east", "prod-eu-west", "staging-central", "legacy-dc"} +NON_OPENSHIFT = {"dev-k8s"} + + +def _check_context(context): + ctx = (context or "prod-us-east").strip() + if ctx in UNREACHABLE: + raise ConnectionError(f"Connection refused to {ctx}") + valid = {c[0] for c in CONTEXTS} + if ctx not in valid: + raise ValueError(f"Unknown context: {ctx}") + return ctx + + +def _format_tabular(headers, rows): + if not headers or not rows: + return "" + widths = [len(h) for h in headers] + for row in rows: + for i, h in enumerate(headers): + val = str(row.get(h, "")) + widths[i] = max(widths[i], len(val)) + lines = [] + header_line = "".join(h.ljust(w + 2) for h, w in zip(headers, widths)) + lines.append(header_line.rstrip()) + for row in rows: + line = "".join(str(row.get(h, "")).ljust(w + 2) for h, w in zip(headers, widths)) + lines.append(line.rstrip()) + return "\n".join(lines) + + +# Node data for resources_get (Node kind) +NODE_DATA = { + "prod-us-east": { + "node-us-master-1": { + "metadata": {"name": "node-us-master-1", "labels": {"node-role.kubernetes.io/master": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "16Gi", "pods": "250"}, "conditions": []}, + }, + "node-us-master-2": { + "metadata": {"name": "node-us-master-2", "labels": {"node-role.kubernetes.io/master": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "16Gi", "pods": "250"}, "conditions": []}, + }, + "node-us-master-3": { + "metadata": {"name": "node-us-master-3", "labels": {"node-role.kubernetes.io/master": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "16Gi", "pods": "250"}, "conditions": []}, + }, + "node-us-worker-1": { + "metadata": {"name": "node-us-worker-1", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": { + "allocatable": {"cpu": "32", "memory": "128Gi", "pods": "250", "nvidia.com/gpu": "4"}, + "conditions": [], + }, + }, + "node-us-worker-2": { + "metadata": {"name": "node-us-worker-2", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, "conditions": []}, + }, + "node-us-worker-3": { + "metadata": {"name": "node-us-worker-3", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250", "nvidia.com/gpu": "4"}, + "conditions": [], + }, + }, + }, + "prod-eu-west": { + "node-eu-master-1": { + "metadata": {"name": "node-eu-master-1", "labels": {"node-role.kubernetes.io/master": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "16Gi", "pods": "250"}, "conditions": []}, + }, + "node-eu-worker-1": { + "metadata": {"name": "node-eu-worker-1", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, "conditions": []}, + }, + "node-eu-worker-2": { + "metadata": {"name": "node-eu-worker-2", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, "conditions": []}, + }, + "node-eu-worker-3": { + "metadata": {"name": "node-eu-worker-3", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, "conditions": []}, + }, + }, + "staging-central": { + "node-staging-master-1": { + "metadata": {"name": "node-staging-master-1", "labels": {"node-role.kubernetes.io/master": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "16Gi", "pods": "250"}, "conditions": []}, + }, + "node-staging-worker-1": { + "metadata": {"name": "node-staging-worker-1", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, "conditions": []}, + }, + "node-staging-worker-2": { + "metadata": {"name": "node-staging-worker-2", "labels": {"node-role.kubernetes.io/worker": ""}}, + "status": {"allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, "conditions": []}, + }, + }, + "dev-k8s": { + "node-dev-1": { + "metadata": {"name": "node-dev-1", "labels": {"node-role.kubernetes.io/control-plane": ""}}, + "status": {"allocatable": {"cpu": "4", "memory": "8Gi", "pods": "110"}, "conditions": []}, + }, + "node-dev-2": { + "metadata": {"name": "node-dev-2", "labels": {}}, + "status": {"allocatable": {"cpu": "4", "memory": "8Gi", "pods": "110"}, "conditions": []}, + }, + }, +} + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all kubeconfig contexts with server URLs and cluster info.""" + headers = ["CONTEXT", "SERVER", "VERSION", "NODES", "UTILIZATION"] + rows = [{"CONTEXT": c[0], "SERVER": c[1], "VERSION": c[2], "NODES": str(c[3]), "UTILIZATION": c[4]} for c in CONTEXTS] + return _format_tabular(headers, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str | None = None, + context: str | None = None, +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + ctx = _check_context(context) + + if apiVersion == "config.openshift.io/v1" and kind == "ClusterVersion": + if ctx in NON_OPENSHIFT: + raise ValueError("ClusterVersion not found (non-OpenShift cluster)") + versions = { + "prod-us-east": "4.16.3", + "prod-eu-west": "4.15.12", + "staging-central": "4.16.1", + "legacy-dc": "4.14", + } + ver = versions.get(ctx, "4.16.0") + return f'{{"apiVersion":"config.openshift.io/v1","kind":"ClusterVersion","metadata":{{"name":"version"}},"status":{{"desired":{{"version":"{ver}"}}}}}}' + + if apiVersion == "v1" and kind == "Node": + nodes = NODE_DATA.get(ctx, {}) + if name not in nodes: + raise ValueError(f"Node {name} not found") + return json.dumps(nodes[name]) + + raise ValueError(f"Unsupported resource: {apiVersion}/{kind}") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str | None = None, + context: str | None = None, +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + ctx = _check_context(context) + + if apiVersion == "v1" and kind == "Node": + nodes = NODE_DATA.get(ctx, {}) + return json.dumps(list(nodes.values())) + + if apiVersion == "v1" and kind == "Namespace": + return namespaces_list(context=ctx) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def nodes_top(context: str | None = None) -> str: + """Return node CPU and memory usage from Metrics Server.""" + ctx = _check_context(context) + + # prod-us-east: node-us-worker-1 (28.4/32=89%, 112.6/128=88%), node-us-worker-3 (14.2/16=89%, 56.8/64=89%) + if ctx == "prod-us-east": + rows = [ + {"NAME": "node-us-master-1", "CPU(cores)": "1.2", "MEMORY(bytes)": "4Gi"}, + {"NAME": "node-us-master-2", "CPU(cores)": "1.1", "MEMORY(bytes)": "3.8Gi"}, + {"NAME": "node-us-master-3", "CPU(cores)": "1.0", "MEMORY(bytes)": "3.6Gi"}, + {"NAME": "node-us-worker-1", "CPU(cores)": "28.4", "MEMORY(bytes)": "112.6Gi"}, + {"NAME": "node-us-worker-2", "CPU(cores)": "8.2", "MEMORY(bytes)": "32Gi"}, + {"NAME": "node-us-worker-3", "CPU(cores)": "14.2", "MEMORY(bytes)": "56.8Gi"}, + ] + elif ctx == "prod-eu-west": + rows = [ + {"NAME": "node-eu-master-1", "CPU(cores)": "0.8", "MEMORY(bytes)": "3Gi"}, + {"NAME": "node-eu-worker-1", "CPU(cores)": "6.2", "MEMORY(bytes)": "24Gi"}, + {"NAME": "node-eu-worker-2", "CPU(cores)": "5.8", "MEMORY(bytes)": "22Gi"}, + {"NAME": "node-eu-worker-3", "CPU(cores)": "7.1", "MEMORY(bytes)": "28Gi"}, + ] + elif ctx == "staging-central": + rows = [ + {"NAME": "node-staging-master-1", "CPU(cores)": "0.5", "MEMORY(bytes)": "2Gi"}, + {"NAME": "node-staging-worker-1", "CPU(cores)": "2.1", "MEMORY(bytes)": "8Gi"}, + {"NAME": "node-staging-worker-2", "CPU(cores)": "1.8", "MEMORY(bytes)": "7Gi"}, + ] + elif ctx == "dev-k8s": + rows = [ + {"NAME": "node-dev-1", "CPU(cores)": "1.2", "MEMORY(bytes)": "3Gi"}, + {"NAME": "node-dev-2", "CPU(cores)": "2.0", "MEMORY(bytes)": "5Gi"}, + ] + else: + rows = [] + + headers = ["NAME", "CPU(cores)", "MEMORY(bytes)"] + return _format_tabular(headers, rows) + + +@mcp.tool() +def pods_list(namespace: str | None = None, context: str | None = None) -> str: + """List pods across namespaces.""" + ctx = _check_context(context) + + if ctx == "prod-us-east": + rows = [ + {"NAMESPACE": "batch-jobs", "NAME": "data-pipeline-batch-abc", "STATUS": "Failed"}, + {"NAMESPACE": "batch-jobs", "NAME": "data-pipeline-batch-def", "STATUS": "Failed"}, + {"NAMESPACE": "ci-cd", "NAME": "image-builder", "STATUS": "CrashLoopBackOff"}, + {"NAMESPACE": "app-platform", "NAME": "deploy-canary", "STATUS": "Pending"}, + {"NAMESPACE": "default", "NAME": "api-server", "STATUS": "Running"}, + {"NAMESPACE": "default", "NAME": "web-frontend", "STATUS": "Running"}, + {"NAMESPACE": "openshift-monitoring", "NAME": "prometheus-0", "STATUS": "Running"}, + ] + elif ctx == "prod-eu-west": + rows = [ + {"NAMESPACE": "security", "NAME": "compliance-scanner-failed", "STATUS": "Failed"}, + {"NAMESPACE": "default", "NAME": "api-eu", "STATUS": "Running"}, + ] + elif ctx == "staging-central": + rows = [ + {"NAMESPACE": "staging-apps", "NAME": "image-pull-broken-pod", "STATUS": "ImagePullBackOff"}, + {"NAMESPACE": "default", "NAME": "staging-api", "STATUS": "Running"}, + ] + elif ctx == "dev-k8s": + rows = [ + {"NAMESPACE": "default", "NAME": "dev-pod-1", "STATUS": "Running"}, + {"NAMESPACE": "kube-system", "NAME": "coredns-xyz", "STATUS": "Running"}, + ] + else: + rows = [] + + headers = ["NAMESPACE", "NAME", "STATUS"] + return _format_tabular(headers, rows) + + +@mcp.tool() +def projects_list(context: str | None = None) -> str: + """List OpenShift projects.""" + ctx = _check_context(context) + if ctx in NON_OPENSHIFT: + raise ValueError("projects_list is OpenShift-only; use namespaces_list for vanilla Kubernetes") + + counts = {"prod-us-east": 21, "prod-eu-west": 16, "staging-central": 12, "legacy-dc": 8} + n = counts.get(ctx, 5) + rows = [{"NAME": f"project-{i}"} for i in range(1, n + 1)] + headers = ["NAME"] + return _format_tabular(headers, rows) + + +@mcp.tool() +def namespaces_list(context: str | None = None) -> str: + """List all namespaces in a cluster.""" + ctx = _check_context(context) + + if ctx == "dev-k8s": + # 6 namespaces for vanilla Kubernetes + rows = [ + {"NAME": "default"}, + {"NAME": "kube-system"}, + {"NAME": "kube-public"}, + {"NAME": "kube-node-lease"}, + {"NAME": "app-dev"}, + {"NAME": "monitoring"}, + ] + else: + # OpenShift: projects map to namespaces + counts = {"prod-us-east": 21, "prod-eu-west": 16, "staging-central": 12} + n = counts.get(ctx, 5) + rows = [{"NAME": f"project-{i}"} for i in range(1, n + 1)] + + headers = ["NAME"] + return _format_tabular(headers, rows) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/ocp-admin__cluster-report/instruction.md b/evaluation/without_skills/ocp-admin__cluster-report/instruction.md new file mode 100644 index 00000000..b13ffc9a --- /dev/null +++ b/evaluation/without_skills/ocp-admin__cluster-report/instruction.md @@ -0,0 +1,17 @@ +# Cluster Report Task + +You are an OpenShift cluster administrator. Your operations lead has requested a comprehensive infrastructure health snapshot for the weekly review. Your environment has multiple cluster contexts configured. + +## Requirements +- Discover all available cluster contexts in your environment +- For each accessible OpenShift cluster, report: + - Cluster version and API server URL + - All nodes with their status (Ready/NotReady), roles, and resource utilization (CPU and memory usage vs capacity) + - All projects/namespaces with their status + - Workload counts: total pods, running vs failing, and any pods in error states +- Explicitly note any contexts that are not OpenShift clusters or could not be reached, and explain why +- Highlight any issues that need attention (unhealthy nodes, resource pressure, failing workloads) + +Use MCP tools to examine the clusters. Write the complete cluster report in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/ocp-admin__cluster-report/solution/solve.sh b/evaluation/without_skills/ocp-admin__cluster-report/solution/solve.sh new file mode 100644 index 00000000..62bd7e47 --- /dev/null +++ b/evaluation/without_skills/ocp-admin__cluster-report/solution/solve.sh @@ -0,0 +1,30 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Multi-Cluster Health Report + +## Cluster Discovery +Use configuration_contexts_list for kubeconfig contexts. Verify each with resources_get(apiVersion="config.openshift.io/v1", kind="ClusterVersion", name="version"). + +## Cluster Contexts +| Context | Type | Server | +|---------|------|--------| +| ocp-prod | OpenShift (ClusterVersion detected) | https://api.ocp-prod.example.com:6443 | + +### OpenShift Detection +Check for ClusterVersion resource: config.openshift.io/v1. Non-OpenShift contexts excluded by default. + +## Node Resources +| Node | CPU | Memory | GPUs | +|------|-----|--------|------| +| worker-01 | 16 cores (45% used) | 64Gi (60% used) | 2 | +| worker-02 | 16 cores (30% used) | 64Gi (40% used) | 0 | + +## Pod Status +| Namespace | Running | Pending | Failed | +|-----------|---------|---------|--------| +| default | 5 | 0 | 0 | +| openshift-operators | 12 | 0 | 1 | + +### Generated using assemble.py and aggregate.py helper scripts +Persist MCP output to /tmp/cluster-report/. Manifest with $file refs. Projects_list (fallback namespaces_list for non-OpenShift) +REPORT_EOF diff --git a/evaluation/without_skills/ocp-admin__cluster-report/task.toml b/evaluation/without_skills/ocp-admin__cluster-report/task.toml new file mode 100644 index 00000000..51a06299 --- /dev/null +++ b/evaluation/without_skills/ocp-admin__cluster-report/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "ocp-admin__cluster-report" +name = "ocp-admin Multi-Cluster Health Report Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["ocp-admin", "cluster-report", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/ocp-admin__cluster-report/tests/llm_judge.py b/evaluation/without_skills/ocp-admin__cluster-report/tests/llm_judge.py new file mode 100644 index 00000000..6c379f29 --- /dev/null +++ b/evaluation/without_skills/ocp-admin__cluster-report/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "clusterversion_verification", + "file": "/root/report.md", + "question": "Does the report explicitly mention probing or checking the ClusterVersion resource as the method used to determine whether each context is an OpenShift cluster?", + "reference": "A skilled report should mention 'ClusterVersion' as the API resource used to verify OpenShift status. It should explain that dev-k8s was classified as non-OpenShift because no ClusterVersion resource was found. Simply saying 'vanilla Kubernetes' or 'not OpenShift' without mentioning the ClusterVersion verification mechanism is insufficient." + }, + { + "id": "exclusion_methodology", + "file": "/root/report.md", + "question": "Does the report treat non-OpenShift clusters (like dev-k8s) as EXCLUDED from the detailed report — listing them briefly in an exclusion section — rather than including them as full sections with node/pod details?", + "reference": "A skilled report should have a separate 'Excluded Clusters' or 'Non-OpenShift' section where dev-k8s is listed briefly with the reason for exclusion. A report that includes dev-k8s as a full section with node details, namespaces, and pod data is NOT demonstrating the skill's exclusion methodology." + }, + { + "id": "aggregated_totals", + "file": "/root/report.md", + "question": "Does the report include aggregated totals across all OpenShift clusters — total nodes, total CPU, total memory, total GPUs — in a comparison or summary table?", + "reference": "A skilled report should have a comparison table with a 'Total' row showing aggregate counts (e.g., 13 nodes total, 148 CPU cores, 592 GiB memory, 8 GPUs). Reports that list each cluster's data without cross-cluster aggregation are insufficient." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/ocp-admin__cluster-report/tests/test.sh b/evaluation/without_skills/ocp-admin__cluster-report/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/ocp-admin__cluster-report/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/ocp-admin__cluster-report/tests/test_outputs.py b/evaluation/without_skills/ocp-admin__cluster-report/tests/test_outputs.py new file mode 100644 index 00000000..5c65747c --- /dev/null +++ b/evaluation/without_skills/ocp-admin__cluster-report/tests/test_outputs.py @@ -0,0 +1,105 @@ +""" +Tests for ocp-admin__cluster-report per-skill evaluation. +Baseline tests: any competent agent should pass. +Skill-dependent tests: based on empirical gaps between skilled and unskilled agent outputs. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_cluster(self): + content = read_report().lower() + assert any(t in content for t in ["cluster", "openshift", "node"]), ( + "report should mention cluster" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_clusterversion_resource(self): + """Skill teaches to probe the ClusterVersion resource to verify OpenShift. + Without skill, agents say 'vanilla Kubernetes' without mentioning the mechanism.""" + c = read_report().lower() + assert "clusterversion" in c or "cluster version resource" in c, ( + "should mention ClusterVersion resource as the OpenShift verification method" + ) + + def test_aggregated_cross_cluster_totals(self): + """Skill teaches a comparison table with aggregated totals across clusters. + Without skill, agents report each cluster separately without totals.""" + c = read_report().lower() + has_total_label = "total" in c or "aggregate" in c or "combined" in c + has_aggregate_context = any(t in c for t in [ + "total node", "total cpu", "total memory", "total gpu", + "across cluster", "combined resource", "aggregate", + ]) or (has_total_label and any(t in c for t in ["node", "cpu", "core", "memory", "gi"])) + assert has_total_label and has_aggregate_context, ( + "should include aggregated cross-cluster totals (total nodes, CPU, memory)" + ) + + def test_non_openshift_exclusion(self): + """Skill teaches to EXCLUDE non-OpenShift clusters from detailed reporting. + Without skill, agents include dev-k8s as a full section with nodes/pods/namespaces.""" + c = read_report().lower() + has_exclusion = any(t in c for t in [ + "excluded", "exclude", "excluded by default", "not included", + "omitted", "non-openshift", + ]) + assert has_exclusion and "dev-k8s" in c, ( + "should explicitly exclude non-OpenShift clusters from detailed data" + ) + + def test_unreachable_reporting(self): + """Both agents should mention unreachable clusters, but skill teaches categorization.""" + c = read_report().lower() + assert "legacy-dc" in c and any(t in c for t in [ + "unreachable", "connection refused", "offline", + ]), "should report legacy-dc as unreachable" + + def test_gpu_inventory(self): + """Skill template includes GPU column — moderate discriminator.""" + c = read_report().lower() + assert "gpu" in c, "should include GPU information" + + def test_version_numbers(self): + """Both agents get versions from MCP, but skill ensures all clusters are covered.""" + c = read_report() + versions = sum(1 for v in ["4.16.3", "4.15.12", "4.16.1"] if v in c) + assert versions >= 2, "should report exact version numbers for multiple clusters" + + def test_multi_cluster_tooling(self): + """Docs teach multi-cluster tooling/automation for consistent reporting. + Without docs, agents rely on manual kubectl context switching.""" + c = read_report().lower() + assert any(t in c for t in [ + "build-kubeconfig", "kubeconfig.py", "cluster-reporter", + "multi-cluster", "multiple context", "all contexts", + "setup script", "automation", + ]), "should reference multi-cluster tooling or automation approach" + + def test_rbac_for_reporting(self): + """Docs teach read-only RBAC (ClusterRole/ServiceAccount) for cluster reporting + instead of admin credentials.""" + c = read_report().lower() + assert any(t in c for t in [ + "cluster-reporter-readonly", "cluster-reporter-system", + "readonly", "read-only", "clusterrole", + "service account", "serviceaccount", "rbac", + "least privilege", "non-admin", + ]), "should reference read-only RBAC for cluster reporting" diff --git a/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/Dockerfile b/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/Dockerfile new file mode 100644 index 00000000..93448fa3 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/Dockerfile @@ -0,0 +1,71 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + }, \ + "observability": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-observability-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-observability-mcp.py b/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-observability-mcp.py new file mode 100644 index 00000000..f150dcff --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-observability-mcp.py @@ -0,0 +1,260 @@ +#!/usr/bin/env python3 +"""Mock Observability MCP server for SkillsBench rh-ai-engineer__ai-observability task. + +Simulates Prometheus/Grafana-style metrics for inference services: latency, +throughput, error rates, GPU utilization, resource usage, and alerts. + +Scenario (aligned with rhoai/openshift mocks): +- ml-production namespace: + - text-gen-legacy (Mistral 7B on vLLM): OOMKilled; before crash: 22GB/24GB GPU, + p99=2800ms, throughput=3 req/s, error rate=15% + - nim-llama-prod (Llama 3.1 8B on NIM): not running, no metrics (empty/error) + - sentiment-classifier: running well, 4GB/24GB GPU, p99=45ms, throughput=150 req/s, + error rate=0.1% +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("observability") + +# ── Mock metrics data ────────────────────────────────────────────────────── + +# text-gen-legacy: OOMKilled, metrics from before crash +MODEL_METRICS = { + "ml-production": { + "text-gen-legacy": { + "status": "OOMKilled", + "latency_ms": {"p50": 1200, "p95": 2100, "p99": 2800}, + "throughput_req_per_sec": 3.0, + "error_rate_percent": 15.0, + "input_tokens_per_sec": 45, + "output_tokens_per_sec": 12, + "total_requests_24h": 259200, # 3 * 86400 + }, + "nim-llama-prod": None, # not running, no metrics + "sentiment-classifier": { + "status": "Running", + "latency_ms": {"p50": 18, "p95": 38, "p99": 45}, + "throughput_req_per_sec": 150.0, + "error_rate_percent": 0.1, + "input_tokens_per_sec": 1200, + "output_tokens_per_sec": 50, + "total_requests_24h": 12960000, + }, + }, +} + +GPU_UTILIZATION = { + "ml-production": [ + { + "pod": "text-gen-legacy-predictor-00001-abc12", + "model": "text-gen-legacy", + "gpu_memory_used_gb": 22.0, + "gpu_memory_total_gb": 24.0, + "gpu_memory_utilization_percent": 91.7, + "gpu_compute_utilization_percent": 35.0, + "status": "OOMKilled", + }, + { + "pod": "sentiment-classifier-predictor-00001-xyz99", + "model": "sentiment-classifier", + "gpu_memory_used_gb": 4.0, + "gpu_memory_total_gb": 24.0, + "gpu_memory_utilization_percent": 16.7, + "gpu_compute_utilization_percent": 42.0, + "status": "Running", + }, + # nim-llama-prod: no pod + ], +} + +RESOURCE_USAGE = { + "ml-production": [ + { + "pod": "text-gen-legacy-predictor-00001-abc12", + "model": "text-gen-legacy", + "cpu_request": "4", + "cpu_limit": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "cpu_actual_usage": "3.2", + "memory_actual_usage_mib": 16384, + "status": "CrashLoopBackOff", + }, + { + "pod": "sentiment-classifier-predictor-00001-xyz99", + "model": "sentiment-classifier", + "cpu_request": "2", + "cpu_limit": "4", + "memory_request": "8Gi", + "memory_limit": "16Gi", + "cpu_actual_usage": "1.1", + "memory_actual_usage_mib": 4096, + "status": "Running", + }, + ], +} + +PROMETHEUS_ALERTS = { + "ml-production": [ + { + "name": "InferenceServiceOOMKilled", + "severity": "critical", + "state": "firing", + "summary": "text-gen-legacy predictor pod OOMKilled", + "description": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "labels": { + "inference_service": "text-gen-legacy", + "namespace": "ml-production", + }, + }, + { + "name": "HighInferenceLatency", + "severity": "warning", + "state": "firing", + "summary": "text-gen-legacy p99 latency > 2000ms", + "description": "Inference latency p99 is 2800ms, exceeding threshold of 2000ms.", + "labels": { + "inference_service": "text-gen-legacy", + "namespace": "ml-production", + }, + }, + { + "name": "HighErrorRate", + "severity": "warning", + "state": "firing", + "summary": "text-gen-legacy error rate 15%", + "description": "Inference error rate is 15%, exceeding threshold of 5%.", + "labels": { + "inference_service": "text-gen-legacy", + "namespace": "ml-production", + }, + }, + ], +} + + +# ── Tools ────────────────────────────────────────────────────────────────── + + +@mcp.tool() +def query_model_metrics( + model_name: str, + namespace: str, + metric_type: str = "all", +) -> str: + """Query inference metrics for a model. Returns latency (p50/p95/p99), throughput + (requests/sec), error rates, and token counts. + + metric_type: 'all', 'latency', 'throughput', 'errors', or 'tokens' + """ + ns_data = MODEL_METRICS.get(namespace) + if not ns_data: + return json.dumps({"error": f"Namespace '{namespace}' not found"}, indent=2) + + metrics = ns_data.get(model_name) + if metrics is None: + return json.dumps({ + "error": f"No metrics for model '{model_name}' in namespace '{namespace}'. " + "Model may not be running (e.g., nim-llama-prod has no pods).", + "model_name": model_name, + "namespace": namespace, + }, indent=2) + + result = { + "model_name": model_name, + "namespace": namespace, + "status": metrics["status"], + } + + if metric_type in ("all", "latency"): + result["latency_ms"] = metrics["latency_ms"] + if metric_type in ("all", "throughput"): + result["throughput_req_per_sec"] = metrics["throughput_req_per_sec"] + result["total_requests_24h"] = metrics.get("total_requests_24h") + if metric_type in ("all", "errors"): + result["error_rate_percent"] = metrics["error_rate_percent"] + if metric_type in ("all", "tokens"): + result["input_tokens_per_sec"] = metrics["input_tokens_per_sec"] + result["output_tokens_per_sec"] = metrics["output_tokens_per_sec"] + + return json.dumps(result, indent=2) + + +@mcp.tool() +def query_gpu_utilization(namespace: str) -> str: + """Query GPU memory used/total and compute utilization per inference pod.""" + pods = GPU_UTILIZATION.get(namespace, []) + if not pods: + return json.dumps({ + "namespace": namespace, + "pods": [], + "message": "No GPU-backed inference pods found in namespace.", + }, indent=2) + return json.dumps({ + "namespace": namespace, + "pods": pods, + }, indent=2) + + +@mcp.tool() +def query_resource_usage(namespace: str) -> str: + """Query actual CPU/memory usage vs requests/limits for inference pods.""" + pods = RESOURCE_USAGE.get(namespace, []) + if not pods: + return json.dumps({ + "namespace": namespace, + "pods": [], + "message": "No inference pods found in namespace.", + }, indent=2) + return json.dumps({ + "namespace": namespace, + "pods": pods, + }, indent=2) + + +@mcp.tool() +def list_prometheus_alerts(namespace: str) -> str: + """List firing Prometheus alerts related to inference services in the namespace.""" + alerts = PROMETHEUS_ALERTS.get(namespace, []) + return json.dumps({ + "namespace": namespace, + "alerts": alerts, + "firing_count": len(alerts), + }, indent=2) + + +@mcp.tool() +def get_model_performance_summary(namespace: str) -> str: + """Get aggregated performance data across all models in the namespace.""" + ns_data = MODEL_METRICS.get(namespace) + if not ns_data: + return json.dumps({"error": f"Namespace '{namespace}' not found"}, indent=2) + + models = [] + for name, metrics in ns_data.items(): + if metrics is None: + models.append({ + "model_name": name, + "status": "NotRunning", + "error": "No metrics available (pod not created or not running)", + }) + else: + models.append({ + "model_name": name, + "status": metrics["status"], + "latency_p99_ms": metrics["latency_ms"]["p99"], + "throughput_req_per_sec": metrics["throughput_req_per_sec"], + "error_rate_percent": metrics["error_rate_percent"], + }) + + return json.dumps({ + "namespace": namespace, + "models": models, + }, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..e7a4d11c --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,457 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..0ae9e4cb --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ai-observability/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,780 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__ai-observability/instruction.md b/evaluation/without_skills/rh-ai-engineer__ai-observability/instruction.md new file mode 100644 index 00000000..f76c1829 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ai-observability/instruction.md @@ -0,0 +1,13 @@ +# AI Observability Task + +You are an AI engineer on Red Hat OpenShift AI. Your team has deployed several inference services, but has no visibility into how they are performing or whether resources are sized correctly. + +## Requirements +- Assess the current state of deployed inference services and their resource consumption +- Define a metrics strategy covering: inference latency, throughput, error rates, and GPU memory utilization +- Identify any models that appear over-provisioned or under-provisioned based on current usage +- Recommend specific resource adjustments (CPU, memory, GPU, replicas) with justification + +Document your observability strategy and resource recommendations in `/root/report.md`. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-ai-engineer__ai-observability/solution/solve.sh b/evaluation/without_skills/rh-ai-engineer__ai-observability/solution/solve.sh new file mode 100644 index 00000000..d319c204 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ai-observability/solution/solve.sh @@ -0,0 +1,23 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# AI Observability Report + +## Model: fraud-detection + +### GPU Utilization (last 15m) +- GPU memory: 65% utilized +- GPU compute: 45% utilized +- Recommendation: GPU is underutilized, consider right-sizing + +### Model Latency +- P50: 120ms +- P99: 450ms + +### Right-Sizing +- Current: 1x A100 80GB +- Recommended: 1x A100 40GB (sufficient for workload) + +### Advanced Observability +- execute_promql for custom metrics (e.g., vllm:request_success:ratio) +- query_tempo_tool for trace latency on slow requests +REPORT_EOF diff --git a/evaluation/without_skills/rh-ai-engineer__ai-observability/task.toml b/evaluation/without_skills/rh-ai-engineer__ai-observability/task.toml new file mode 100644 index 00000000..130dece1 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ai-observability/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__ai-observability" +name = "rh-ai-engineer AI Observability Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "ai-observability", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-ai-engineer__ai-observability/tests/llm_judge.py b/evaluation/without_skills/rh-ai-engineer__ai-observability/tests/llm_judge.py new file mode 100644 index 00000000..28b8d6a0 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ai-observability/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "observability_stack_awareness", + "file": "/root/report.md", + "question": "Does the report recommend specific observability tools from the OpenShift AI ecosystem — specifically Tempo for distributed tracing AND Korrel8r for cross-domain correlation — as part of the observability strategy?", + "reference": "A skilled report should mention BOTH Tempo (for distributed tracing of inference requests) and Korrel8r (for correlating signals across logs, metrics, traces, and alerts). These are specific tools the skill teaches. A report that only recommends generic Prometheus/Grafana monitoring without Tempo or Korrel8r is insufficient." + }, + { + "id": "gpu_metric_specificity", + "file": "/root/report.md", + "question": "Does the report reference DCGM (Data Center GPU Manager) metric names (like DCGM_FI_DEV_FB_USED or DCGM_FI_DEV_GPU_UTIL) for GPU monitoring, rather than generic nvidia_gpu_memory metric names?", + "reference": "A skilled report uses DCGM-specific metric names (DCGM_FI_DEV_*) which are the actual metrics exposed by the GPU operator on OpenShift. Using generic names like nvidia_gpu_memory_used_bytes suggests the agent doesn't know the specific metric naming convention." + }, + { + "id": "vllm_tuning_specificity", + "file": "/root/report.md", + "question": "Does the report recommend specific vLLM configuration parameters (like --max-model-len, --gpu-memory-utilization, or tensor parallelism) for resolving GPU memory issues, rather than only recommending generic resource increases?", + "reference": "A skilled report should mention vLLM-specific tuning args like --max-model-len to limit KV cache size, --gpu-memory-utilization to control memory allocation, or tensor parallelism for multi-GPU distribution. Only recommending 'increase memory to 32Gi' without vLLM-specific configuration is insufficient." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-ai-engineer__ai-observability/tests/test.sh b/evaluation/without_skills/rh-ai-engineer__ai-observability/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ai-observability/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-ai-engineer__ai-observability/tests/test_outputs.py b/evaluation/without_skills/rh-ai-engineer__ai-observability/tests/test_outputs.py new file mode 100644 index 00000000..eb3755b2 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ai-observability/tests/test_outputs.py @@ -0,0 +1,91 @@ +""" +Tests for rh-ai-engineer__ai-observability per-skill evaluation. +Baseline tests: any competent agent should pass. +Skill-dependent tests: based on empirical gaps between skilled and unskilled agent outputs. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["monitor", "metric", "observ", "inference"]), ( + "report should mention monitoring or observability" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_tempo_distributed_tracing(self): + """Skill teaches Tempo for distributed tracing of inference requests. + Without skill, agents don't mention Tempo at all.""" + c = read_report().lower() + assert any(t in c for t in ["tempo", "distributed trac"]), ( + "should recommend Tempo for distributed tracing" + ) + + def test_korrel8r_correlation(self): + """Skill teaches Korrel8r for cross-domain signal correlation. + Without skill, agents don't know about Korrel8r.""" + c = read_report().lower() + assert any(t in c for t in ["korrel8r", "cross-domain correlation"]), ( + "should mention Korrel8r for cross-domain correlation" + ) + + def test_dcgm_gpu_metric_names(self): + """Skill teaches DCGM-specific GPU metric names (DCGM_FI_DEV_*). + Without skill, agents use generic nvidia_gpu_memory_* names.""" + c = read_report() + assert any(t in c for t in ["DCGM_FI_DEV", "dcgm_fi_dev", "DCGM"]), ( + "should reference DCGM GPU metric names (not generic nvidia_gpu_*)" + ) + + def test_opentelemetry_instrumentation(self): + """Skill teaches OpenTelemetry for trace instrumentation on inference endpoints. + Without skill, agents don't mention OpenTelemetry.""" + c = read_report().lower() + assert any(t in c for t in ["opentelemetry", "otel"]), ( + "should recommend OpenTelemetry instrumentation" + ) + + def test_vllm_tuning_args(self): + """Skill teaches vLLM CLI args for memory management. + Without skill, agents recommend generic resource increases but not vLLM-specific tuning.""" + c = read_report().lower() + assert any(t in c for t in [ + "max-model-len", "max_model_len", "gpu-memory-utilization", + "gpu_memory_utilization", "tensor parallel", "tensor_parallel", + ]), "should mention vLLM-specific configuration args for resource tuning" + + def test_latency_percentiles(self): + """Both agents should report latency percentiles (easy test).""" + c = read_report().lower() + assert any(t in c for t in ["p50", "p95", "p99"]), ( + "should report latency with percentiles" + ) + + def test_tensor_parallel_size_tuning(self): + """Docs teach reducing --tensor-parallel-size as GPU scheduling triage step, + and OOM mitigation via --max-model-len and quantized models (AWQ/GPTQ/FP8). + Without docs, agents don't know these vLLM tuning parameters.""" + c = read_report().lower() + assert any(t in c for t in [ + "tensor-parallel-size", "tensor_parallel_size", "tensor parallel", + "awq", "gptq", "fp8", "quantiz", + ]), "should address tensor-parallel-size and quantization for GPU tuning" diff --git a/evaluation/without_skills/rh-ai-engineer__debug-inference/environment/Dockerfile b/evaluation/without_skills/rh-ai-engineer__debug-inference/environment/Dockerfile new file mode 100644 index 00000000..aac4c84e --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__debug-inference/environment/Dockerfile @@ -0,0 +1,67 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..e7a4d11c --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,457 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/without_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..0ae9e4cb --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__debug-inference/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,780 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__debug-inference/instruction.md b/evaluation/without_skills/rh-ai-engineer__debug-inference/instruction.md new file mode 100644 index 00000000..11b9268d --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__debug-inference/instruction.md @@ -0,0 +1,13 @@ +# Inference Debugging Task + +You are an AI engineer on Red Hat OpenShift AI. There are failing model inference deployments in the `ml-production` namespace that need debugging. + +## Requirements +- List all InferenceServices in the `ml-production` namespace and identify which ones are not ready +- For each failing InferenceService, diagnose the root cause: check status conditions, pod state, container logs, events, and related resources (ServingRuntime, Account CRs) +- Recommend a specific fix for each failing deployment +- Document your methodology and the diagnostic steps you followed + +Use MCP tools to interact with the platform. Write your complete findings and recommendations in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-ai-engineer__debug-inference/solution/solve.sh b/evaluation/without_skills/rh-ai-engineer__debug-inference/solution/solve.sh new file mode 100644 index 00000000..6b94e02f --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__debug-inference/solution/solve.sh @@ -0,0 +1,36 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Inference Debug Report + +## Diagnosis Categories (get_inference_service verbosity full) + +### 1. ServingRuntime ✓ +ServingRuntime CR exists and is valid + +### 2. Pod Scheduling ✗ +Pod cannot be scheduled — check kserve-container logs (pods_log container=kserve-container) +Label selector: serving.kserve.io/inferenceservice + +### 3. Container Start +KServe sidecar containers may conflict with LimitRange + +### 4. Model Loading +Check model download and initialization + +### 5. GPU Access +Verify GPU allocation and CUDA compatibility + +### 6. Endpoint Health +Check InferenceService URL and readiness (PredictorReady, IngressReady conditions) + +## Events +events_list filtered by namespace for pod/InferenceService events + +## NIM Deployments +For NIM: Check Account CR (nim.opendatahub.io) for NGC credential errors + +## Observability (optional) +- korrel8r_get_correlated for cross-domain signals +- query_tempo_tool for trace latency +- execute_promql for custom metrics +REPORT_EOF diff --git a/evaluation/without_skills/rh-ai-engineer__debug-inference/task.toml b/evaluation/without_skills/rh-ai-engineer__debug-inference/task.toml new file mode 100644 index 00000000..44c5ea2a --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__debug-inference/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__debug-inference" +name = "rh-ai-engineer Inference Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "debug-inference", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-ai-engineer__debug-inference/tests/llm_judge.py b/evaluation/without_skills/rh-ai-engineer__debug-inference/tests/llm_judge.py new file mode 100644 index 00000000..748256ca --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__debug-inference/tests/llm_judge.py @@ -0,0 +1,114 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "kserve_conditions_awareness", + "file": "/root/report.md", + "question": "Does the report explicitly name the KServe InferenceService status conditions — specifically PredictorReady and IngressReady — and present them in a structured conditions table with Status/Reason/Message columns?", + "reference": "A skilled report should present a conditions table showing PredictorReady and IngressReady as distinct conditions with their status (True/False), reason, and message. Simply reporting 'CrashLoopBackOff' or 'pod failing' without naming the specific KServe conditions is insufficient." + }, + { + "id": "kserve_container_specificity", + "file": "/root/report.md", + "question": "Does the report mention 'kserve-container' by name as the specific container to inspect for logs, and reference the serving.kserve.io/inferenceservice label selector as the method for discovering predictor pods?", + "reference": "A skilled report should mention 'kserve-container' as the container name for log inspection and reference the serving.kserve.io/inferenceservice label selector for pod discovery. Generically saying 'check pod logs' or 'look at the container' without these specific KServe identifiers is insufficient." + }, + { + "id": "nim_account_cr_pattern", + "file": "/root/report.md", + "question": "Does the report prescribe creating a NIM Account custom resource (kind: Account) as the credential management mechanism for NVIDIA NIM, rather than only manually creating docker-registry secrets and patching service accounts?", + "reference": "A skilled report creates a NIM Account CR (kind: Account, apiVersion: nvidia.com/v1alpha1) with ngcSecret reference and imagePullSecret auto-creation. An unskilled report manually creates docker-registry secrets and patches service accounts without using the Account CR pattern." + }, + { + "id": "ngc_credential_expiry", + "file": "/root/report.md", + "question": "Does the report identify NGC API key or pull-secret expiry as a possible root cause for image pull failures in NIM deployments, and recommend checking the secret's expiration date?", + "reference": "A skilled report checks whether the NGC pull-secret has expired as a diagnosis step for ImagePullBackOff. An unskilled report treats image pull failures generically without considering credential expiry." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-ai-engineer__debug-inference/tests/test.sh b/evaluation/without_skills/rh-ai-engineer__debug-inference/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__debug-inference/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-ai-engineer__debug-inference/tests/test_outputs.py b/evaluation/without_skills/rh-ai-engineer__debug-inference/tests/test_outputs.py new file mode 100644 index 00000000..60f73901 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__debug-inference/tests/test_outputs.py @@ -0,0 +1,98 @@ +""" +Tests for rh-ai-engineer__debug-inference per-skill evaluation. +Baseline tests: any competent agent should pass. +Skill-dependent tests: based on empirical gaps between skilled and unskilled agent outputs. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["inference", "model", "serving", "deploy"]), ( + "report should mention inference" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_kserve_status_conditions(self): + """Skill teaches presenting PredictorReady and IngressReady as distinct KServe conditions. + Without skill, agents report generic pod status (CrashLoopBackOff) without naming these conditions.""" + c = read_report().lower() + assert any(t in c for t in [ + "predictorready", "predictor ready", "predictor_ready", + "ingressready", "ingress ready", "ingress_ready", + ]), "should name KServe status conditions (PredictorReady, IngressReady)" + + def test_kserve_container_name(self): + """Skill teaches 'kserve-container' as the specific container for log inspection. + Without skill, agents check logs generically without naming this container.""" + c = read_report().lower() + assert "kserve-container" in c or "kserve container" in c, ( + "should mention kserve-container by name as the container to inspect" + ) + + def test_label_selector_methodology(self): + """Skill teaches using serving.kserve.io/inferenceservice label to find predictor pods. + Without skill, agents discover pods through generic namespace listing.""" + c = read_report().lower() + assert any(t in c for t in [ + "serving.kserve.io", "kserve.io/inferenceservice", + ]), "should reference the KServe label selector for predictor pod discovery" + + def test_account_cr_awareness(self): + """Skill teaches NIM Account CR as the credential management mechanism. + Without skill, agents manually create docker-registry secrets and + patch service accounts instead of using the Account custom resource.""" + c = read_report() + assert any(t in c for t in [ + "Account CR", "kind: Account", "Account resource", + "Account custom resource", + ]) or "account cr" in c.lower(), ( + "should reference NIM Account CR as credential management mechanism" + ) + + def test_nim_api_version(self): + """Skill teaches the nvidia.com API group for NIM Account and ngcSecret + field for NGC credential binding. Without skill, agents create + generic secrets without the Account CR pattern.""" + c = read_report().lower() + assert any(t in c for t in [ + "nvidia.com/v1alpha1", "ngcsecret", "ngc_api_key", + ]) or ("account" in c and "api" in c and "nvidia" in c), ( + "should reference NIM Account API version or NGC secret binding" + ) + + def test_root_cause_with_remediation(self): + """Both agents should link diagnosis to fix — easy test.""" + c = read_report().lower() + has_diagnosis = any(t in c for t in ["oom", "memory", "crash", "fail"]) + has_fix = any(t in c for t in ["fix", "recommend", "solution", "increase", "reduce"]) + assert has_diagnosis and has_fix, "should link diagnosis to recommended fix" + + def test_ngc_pull_secret_expiry(self): + """Docs teach NGC pull-secret expiry as a common issue, and + 'Insufficient nvidia.com/gpu' as GPU scheduling error signature. + Without docs, agents miss these specific failure patterns.""" + c = read_report().lower() + assert any(t in c for t in [ + "ngc", "pull-secret", "pull secret", "expir", + "insufficient nvidia.com/gpu", "nvidia.com/gpu", + ]), "should address NGC pull-secret expiry or GPU scheduling errors" diff --git a/evaluation/without_skills/rh-ai-engineer__ds-project-setup/environment/Dockerfile b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/environment/Dockerfile new file mode 100644 index 00000000..aac4c84e --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/environment/Dockerfile @@ -0,0 +1,67 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..e7a4d11c --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,457 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..9b072b37 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,796 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +@mcp.tool() +def setup_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Alias for configure_pipeline_server. Configure a pipeline server for a Data Science Project.""" + return configure_pipeline_server(namespace, data_connection, database) + + +@mcp.tool() +def get_pipeline_status(namespace: str) -> str: + """Alias for get_pipeline_server_status. Get the status of the pipeline server.""" + return get_pipeline_server_status(namespace) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__ds-project-setup/instruction.md b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/instruction.md new file mode 100644 index 00000000..03edef95 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/instruction.md @@ -0,0 +1,20 @@ +# Data Science Project Setup Task + +You are an AI engineer on Red Hat OpenShift AI. Your team needs a new project to host a fraud detection model, with access to an S3 model store and model serving enabled. + +## Requirements +- Create a new data science project named `fraud-detection` with a descriptive display name and appropriate labels +- Configure an S3-compatible data connection for model artifact storage using these details: + - Connection name: `model-store` + - Bucket: `fraud-models` + - Endpoint: `https://s3.us-east-1.amazonaws.com` + - Access key: `AKIA1234EXAMPLE` + - Secret key: `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY` + - Region: `us-east-1` +- Enable model serving on the project and confirm which serving runtimes become available +- Configure the pipeline server using the data connection you created +- Verify the project is fully configured and ready for model deployment + +Document your setup steps, the final project configuration, and any issues encountered in `/root/report.md`. Redact any credentials in the report. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-ai-engineer__ds-project-setup/solution/solve.sh b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/solution/solve.sh new file mode 100644 index 00000000..91e7d226 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/solution/solve.sh @@ -0,0 +1,32 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Data Science Project Setup + +## Project: fraud-detection + +### Namespace Configuration +- Label: `opendatahub.io/dashboard: "true"` (required for RHOAI dashboard visibility) +- Model serving mode: `single` (one model per namespace) + +### S3 Data Connection +- Name: model-artifacts +- Bucket: fraud-detection-models +- Endpoint: https://s3.amazonaws.com +- Access key / Secret key configured (credentials REDACTED in display) +- Region: us-east-1 + +### Pipeline Server +- Data connection: model-artifacts (required for pipeline artifact storage) +- Pipeline server uses data connection for artifacts + +### Project Status (get_project_status) +| Component | Status | +|-----------|--------| +| Project | fraud-detection (created) | +| Data connections | 1 configured | +| Pipeline server | configured | +| Model serving | single mode enabled | + +### Validation +- list_data_science_projects: checked for duplicate project name before create +REPORT_EOF diff --git a/evaluation/without_skills/rh-ai-engineer__ds-project-setup/task.toml b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/task.toml new file mode 100644 index 00000000..83c47b69 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__ds-project-setup" +name = "rh-ai-engineer Data Science Project Setup Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "ds-project-setup", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-ai-engineer__ds-project-setup/tests/llm_judge.py b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/tests/llm_judge.py new file mode 100644 index 00000000..2d6a34c6 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/tests/llm_judge.py @@ -0,0 +1,102 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "partial_credential_redaction", + "file": "/root/report.md", + "question": "Does the report display S3 credentials using partial redaction — showing the first few characters of the access key (like AKIA****) with the rest masked — rather than using placeholder values or full redaction?", + "reference": "A skilled report shows credentials partially redacted: the first few characters of the access key visible followed by asterisks (e.g., 'AKIA****'), and the secret key fully masked ('********'). Using PLACEHOLDER_ACCESS_KEY or completely hiding the access key is insufficient — partial redaction allows verification without exposing the full credential." + }, + { + "id": "secret_manifest_structure", + "file": "/root/report.md", + "question": "Does the report include a Kubernetes Secret manifest (with kind: Secret, apiVersion, metadata, and data fields) showing how the S3 data connection is stored as a K8s resource, rather than just describing the connection narratively?", + "reference": "A skilled report shows the actual K8s Secret YAML structure with kind: Secret, metadata (namespace, name, labels), and data fields containing base64-encoded values. An unskilled report describes the data connection configuration narratively without showing the underlying K8s resource structure." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-ai-engineer__ds-project-setup/tests/test.sh b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-ai-engineer__ds-project-setup/tests/test_outputs.py b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/tests/test_outputs.py new file mode 100644 index 00000000..8978be1d --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__ds-project-setup/tests/test_outputs.py @@ -0,0 +1,113 @@ +""" +Tests for rh-ai-engineer__ds-project-setup per-skill evaluation. +Baseline tests: any competent agent should pass. +Skill-dependent tests: based on empirical gaps between skilled and unskilled agent outputs. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["project", "data science", "namespace"]), ( + "report should mention the project" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_data_connection_secret_keys(self): + """Skill teaches RHOAI data connections are stored as K8s Secrets with specific + key names: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_S3_BUCKET, + AWS_S3_ENDPOINT. Without skill, agents describe connections abstractly.""" + c = read_report() + aws_keys = ["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "AWS_S3_BUCKET", + "AWS_S3_ENDPOINT", "AWS_DEFAULT_REGION"] + mentioned = sum(1 for k in aws_keys if k in c) + assert mentioned >= 2, ( + "should reference specific RHOAI data connection secret key names (AWS_*)" + ) + + def test_credential_partial_redaction(self): + """Skill teaches showing first 4 chars + **** for credentials (e.g., AKIA****). + Without skill, agents use PLACEHOLDER values or full redaction.""" + c = read_report() + has_partial = any(t in c for t in [ + "AKIA****", "AKIA*", "wJal****", "wJal*", + "1234****", "1234*", + ]) + has_stars_with_prefix = "****" in c and any(t in c for t in ["AKIA", "akia"]) + assert has_partial or has_stars_with_prefix, ( + "should use partial credential redaction (first chars visible + ****)" + ) + + def test_k8s_secret_yaml_manifest(self): + """Skill teaches showing the K8s Secret manifest structure for data connections. + Without skill, agents describe connections narratively without YAML.""" + c = read_report() + has_secret_kind = "kind: Secret" in c or "kind:Secret" in c + has_secret_ref = "Secret" in c and ("apiVersion" in c or "metadata" in c) + assert has_secret_kind or has_secret_ref, ( + "should include K8s Secret manifest structure for data connection" + ) + + def test_pipeline_server_with_data_connection(self): + """Skill teaches pipeline server requires a data connection (prerequisite chain). + Without skill, agents skip pipeline server or configure it generically.""" + c = read_report().lower() + has_pipeline = any(t in c for t in ["pipeline server", "pipeline"]) + has_linkage = any(t in c for t in [ + "data connection", "model-store", "artifact storage", + "s3 bucket", "data_connection", + ]) + pipeline_configured = "pipeline" in c and "configured" in c and "not configured" not in c + assert has_pipeline and (has_linkage or pipeline_configured), ( + "should configure pipeline server linked to a data connection" + ) + + def test_base64_secret_values(self): + """Skill teaches showing actual base64-encoded secret values in K8s + Secret YAML manifests. Without skill, agents show credentials in + plain text or fully redacted format.""" + c = read_report() + import re + has_base64 = bool(re.search(r'[A-Za-z0-9+/]{12,}={0,2}', c)) + has_opaque = "Opaque" in c + assert has_base64 or has_opaque, ( + "should include base64-encoded values or Opaque secret type in K8s manifest" + ) + + def test_model_serving_mode(self): + """Both agents should configure model serving — easy test.""" + c = read_report().lower() + assert any(t in c for t in [ + "single", "multi", "model serving", "serving mode", + ]), "should configure model serving mode" + + def test_runtime_selection_context(self): + """Docs teach decision context across runtimes: vLLM (PagedAttention), + NIM (TensorRT-LLM, no compilation), Caikit+TGIS (gRPC-only). + Without docs, agents don't provide runtime comparison context.""" + c = read_report().lower() + assert any(t in c for t in [ + "pagedattention", "paged attention", "tensorrt", "grpc", + "caikit", "vllm", "nim", + ]) and any(t in c for t in ["runtime", "serving", "comparison", "select"]), ( + "should compare runtimes with technical characteristics" + ) diff --git a/evaluation/without_skills/rh-ai-engineer__model-deploy/environment/Dockerfile b/evaluation/without_skills/rh-ai-engineer__model-deploy/environment/Dockerfile new file mode 100644 index 00000000..aac4c84e --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__model-deploy/environment/Dockerfile @@ -0,0 +1,67 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..e7a4d11c --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,457 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/without_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..0ae9e4cb --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__model-deploy/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,780 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__model-deploy/instruction.md b/evaluation/without_skills/rh-ai-engineer__model-deploy/instruction.md new file mode 100644 index 00000000..44f79a58 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__model-deploy/instruction.md @@ -0,0 +1,15 @@ +# Model Deployment Task + +You are an AI engineer on Red Hat OpenShift AI. Your team has trained models ready to serve and needs them deployed as inference endpoints in the `ml-production` project. + +## Requirements +- Examine the existing project, available serving runtimes, and any existing deployments +- Diagnose any failing deployments: check pod conditions, container status, logs, and events to determine root causes +- For GPU memory issues, provide a VRAM budget analysis showing model weight size, KV cache requirements, and available GPU memory — distinguish GPU VRAM constraints from pod system memory limits +- Before recommending fixes, check the namespace environment for resource policies and GPU node scheduling constraints that could block redeployment +- For each failing deployment, provide a complete KServe InferenceService YAML manifest with your recommended fix +- Produce a deployment plan that addresses all identified issues and gets the models serving successfully + +Document your deployment plan, diagnosed issues, environment validation, and recommended fixes in `/root/report.md`. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-ai-engineer__model-deploy/solution/solve.sh b/evaluation/without_skills/rh-ai-engineer__model-deploy/solution/solve.sh new file mode 100644 index 00000000..05b7171e --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__model-deploy/solution/solve.sh @@ -0,0 +1,63 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Model Deployment Plan + +## Diagnosed Issues + +### GPU VRAM Budget Analysis +The vLLM OOM is a **GPU VRAM constraint**, not a pod system memory issue: +- Model weights: ~13.5 GiB loaded into GPU +- KV cache allocation: ~28.5 GiB (at default max_model_len=32768) +- Available VRAM after model load: ~10.1 GiB on A10G (24576 MiB total) +- **Root cause**: Default max_model_len=32768 causes KV cache to exhaust GPU VRAM +- **Fix**: Set MAX_MODEL_LEN=4096 or GPU_MEMORY_UTILIZATION=0.85 + +### LimitRange Conflict +- Namespace LimitRange min CPU: 100m +- KServe sidecar containers request: 10m CPU, 15Mi memory +- **CONFLICT**: Sidecar resources below LimitRange minimum +- Fix: Adjust LimitRange or use annotation to override + +### GPU Node Taints +- GPU nodes may have taint ai-app=true:NoSchedule +- Add matching tolerations to InferenceService predictor spec + +### NIMAccount Dependency +- NIM deployments require a NIMAccount CR to be ready before ServingRuntime can pull images +- Check for NIMAccountNotReady condition if ImagePullBackOff occurs + +## Recommended InferenceService YAML + +```yaml +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + name: llama-3-8b + namespace: ml-production + annotations: + serving.kserve.io/deploymentMode: RawDeployment +spec: + predictor: + model: + modelFormat: + name: vLLM + runtime: vllm-cuda-runtime + storageUri: "hf://meta-llama/Llama-3-8B" + resources: + requests: + cpu: "4" + memory: "32Gi" + nvidia.com/gpu: "1" + containers: + - name: kserve-container + env: + - name: MAX_MODEL_LEN + value: "4096" + - name: GPU_MEMORY_UTILIZATION + value: "0.85" +``` + +## Endpoint +- get_model_endpoint for inference URL +- vLLM: /v1/completions, KServe v2: /v2/models/[model]/infer +REPORT_EOF diff --git a/evaluation/without_skills/rh-ai-engineer__model-deploy/task.toml b/evaluation/without_skills/rh-ai-engineer__model-deploy/task.toml new file mode 100644 index 00000000..90674851 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__model-deploy/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__model-deploy" +name = "rh-ai-engineer Model Deployment Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "model-deploy", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-ai-engineer__model-deploy/tests/llm_judge.py b/evaluation/without_skills/rh-ai-engineer__model-deploy/tests/llm_judge.py new file mode 100644 index 00000000..5cd7c20e --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__model-deploy/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "vram_budget_methodology", + "file": "/root/report.md", + "question": "Does the report present a specific GPU VRAM budget calculation for the vLLM OOM issue — showing the model weight size (~13.5 GiB), the KV cache allocation requirement (~28.5 GiB), and the available VRAM after model load (~10.1 GiB) — and explicitly state that this is a GPU VRAM constraint, NOT a pod system memory issue?", + "reference": "A skilled report shows a VRAM budget breakdown: model weights (~13.5 GiB) loaded into GPU, KV cache requiring ~28.5 GiB, but only ~10.1 GiB available on the 24 GB A10G after model load. It explicitly distinguishes GPU VRAM from pod memory (system RAM). A report that says 'OOMKilled' and recommends increasing pod memory from 16Gi to 32Gi WITHOUT this GPU VRAM analysis is insufficient." + }, + { + "id": "rhoai_deployment_conventions", + "file": "/root/report.md", + "question": "Does the report use RHOAI-specific deployment conventions such as the RawDeployment annotation and GPU_MEMORY_UTILIZATION environment variable configuration, rather than generic Kubernetes deployment patterns?", + "reference": "A skilled report uses serving.kserve.io/deploymentMode: RawDeployment annotation and configures vLLM tuning parameters (GPU_MEMORY_UTILIZATION, MAX_MODEL_LEN) as environment variables in the InferenceService spec. It also identifies NIMAccount CR dependencies for NIM deployments. A report that uses generic Kubernetes deployments or command-line args without RHOAI-specific annotations is insufficient." + }, + { + "id": "kserve_yaml_manifest", + "file": "/root/report.md", + "question": "Does the report include a complete KServe InferenceService YAML manifest with the serving.kserve.io/v1beta1 apiVersion, including metadata (name, namespace) and spec.predictor with model format, storage URI, resource requests, and GPU count?", + "reference": "A skilled report provides a deployable InferenceService YAML with apiVersion: serving.kserve.io/v1beta1, kind: InferenceService, and a complete spec including predictor with model format, runtime reference, storage URI, resource requests (CPU, memory, GPU), and environment variables (VLLM_MAX_MODEL_LEN). A report that only describes fixes in narrative or MCP tool call format without a formal YAML manifest is insufficient." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-ai-engineer__model-deploy/tests/test.sh b/evaluation/without_skills/rh-ai-engineer__model-deploy/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__model-deploy/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-ai-engineer__model-deploy/tests/test_outputs.py b/evaluation/without_skills/rh-ai-engineer__model-deploy/tests/test_outputs.py new file mode 100644 index 00000000..0669d687 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__model-deploy/tests/test_outputs.py @@ -0,0 +1,94 @@ +""" +Tests for rh-ai-engineer__model-deploy per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["model", "deploy", "inference", "serving"]), ( + "report should mention model deployment" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_vram_budget_analysis(self): + """Skill teaches GPU VRAM budget: model weights (13.5 GiB) + KV cache (28.5 GiB) + exceeds A10G capacity (24 GB). Without skill, agents report OOM with approximate + numbers (~14GB) without KV cache sizing or available VRAM calculation.""" + c = read_report() + assert any(t in c for t in [ + "28.5", "10.1 GiB", "10.1 GB", "24576", + ]), ( + "should include specific VRAM budget numbers " + "(KV cache size ~28.5 GiB, available VRAM ~10.1 GiB, or total GPU VRAM 24576 MiB)" + ) + + def test_default_context_window_32768(self): + """Skill teaches that vLLM default max_model_len=32768 causes KV cache to exhaust + GPU VRAM on A10G. Without skill, agents report OOM without identifying the specific + default value that triggers the oversized KV cache allocation.""" + c = read_report() + assert "32768" in c or "32,768" in c, ( + "should identify max_model_len=32768 as the specific vLLM default causing GPU OOM" + ) + + def test_kserve_yaml_apiversion(self): + """Skill teaches creating InferenceService YAML with serving.kserve.io/v1beta1. + Without skill, agents describe fixes via MCP tool calls or narrative without + providing a formal KServe YAML manifest with the correct apiVersion.""" + c = read_report() + assert "serving.kserve.io/v1beta1" in c, ( + "should include InferenceService YAML manifest with serving.kserve.io/v1beta1 apiVersion" + ) + + def test_raw_deployment_mode(self): + """Skill teaches using serving.kserve.io/deploymentMode: RawDeployment annotation + for RHOAI model deployments. Without skill, agents omit this RHOAI-specific + annotation, which controls how KServe deploys the predictor.""" + c = read_report() + assert "RawDeployment" in c or "deploymentMode" in c, ( + "should include RawDeployment annotation (RHOAI deployment mode)" + ) + + def test_known_model_profile(self): + """Docs teach known model profiles: e.g., Llama 3.1 8B needs 1 GPU with 16GB VRAM, + --max-model-len=4096; 70B needs 4xA100 80GB with --tensor-parallel-size=4. + Without docs, agents can't size GPU allocation per model.""" + c = read_report().lower() + assert any(t in c for t in [ + "max-model-len", "max_model_len", "tensor-parallel-size", + "tensor_parallel_size", "16gb", "a100", "a10g", + ]) or ("gpu" in c and ("vram" in c or "model" in c and "profile" in c)), ( + "should reference known model GPU profiles for deployment sizing" + ) + + def test_nim_account_cr(self): + """Skill teaches that NIM deployments require a NIMAccount CR to be ready + before the ServingRuntime can pull images. Without skill, agents diagnose + ImagePullBackOff generically without identifying the NIMAccount dependency.""" + c = read_report() + assert any(t in c for t in [ + "NIMAccount", "NimAccount", "nim-account", "NIM Account", + "NIMAccountNotReady", + ]), "should identify NIMAccount CR as prerequisite for NIM deployment" diff --git a/evaluation/without_skills/rh-ai-engineer__nim-setup/environment/Dockerfile b/evaluation/without_skills/rh-ai-engineer__nim-setup/environment/Dockerfile new file mode 100644 index 00000000..aac4c84e --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__nim-setup/environment/Dockerfile @@ -0,0 +1,67 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..d43c891d --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,540 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import base64 +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_create_or_update( + api_version: str, + kind: str, + namespace: str, + name: str, + body: str, +) -> str: + """Create or update a Kubernetes resource. Accepts apiVersion, kind, namespace, name, and body (JSON).""" + try: + resource = json.loads(body) + except json.JSONDecodeError as e: + raise ValueError(f"Invalid JSON body: {e}") from e + + resource.setdefault("metadata", {}) + resource["metadata"]["name"] = name + resource["metadata"]["namespace"] = namespace + resource["apiVersion"] = api_version + resource["kind"] = kind + + if kind == "Secret": + resource.setdefault("type", "Opaque") + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"Secret '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + if kind in ("NIMAccount", "Account") and "nim" in api_version.lower(): + resource.setdefault("status", {}) + resource["status"]["conditions"] = [ + { + "type": "Ready", + "status": "True", + "reason": "NGCCredentialsValid", + "message": "NGC API key validated successfully", + "lastTransitionTime": "2026-03-17T12:00:00Z", + }, + ] + resource["status"]["nimPullSecretStatus"] = "Ready" + resource["status"]["nimConfigStatus"] = "Ready" + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"NIM Account '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + if kind == "ConfigMap": + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"ConfigMap '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + raise ValueError(f"Unsupported kind for create/update: {kind}") + + +@mcp.tool() +def create_secret( + namespace: str, + name: str, + data: dict, + type: str = "Opaque", +) -> str: + """Create a Secret in a namespace. data is a dict of key-value pairs (values will be base64-encoded).""" + if isinstance(data, str): + data = json.loads(data) + encoded_data = {k: base64.b64encode(str(v).encode()).decode() for k, v in data.items()} + resource = { + "apiVersion": "v1", + "kind": "Secret", + "metadata": {"name": name, "namespace": namespace}, + "type": type, + "data": encoded_data, + } + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"Secret '{name}' created in namespace '{namespace}'", + }, indent=2) + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/without_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..0ae9e4cb --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__nim-setup/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,780 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__nim-setup/instruction.md b/evaluation/without_skills/rh-ai-engineer__nim-setup/instruction.md new file mode 100644 index 00000000..f0b5fa2c --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__nim-setup/instruction.md @@ -0,0 +1,17 @@ +# NVIDIA NIM Setup Task + +You are an AI engineer on Red Hat OpenShift AI. Your team wants to deploy NVIDIA NIM for GPU-accelerated inference, but the cluster has not been set up for it yet. + +## Scenario +The ML team needs to serve models using NVIDIA's inference microservices. The cluster has GPUs available, but the necessary platform components and credentials have not been configured. You need to assess readiness and produce a complete setup plan. + +## Requirements +- Verify operator prerequisites (GPU Operator and NFD Operator) by checking their ClusterServiceVersion status +- Assess the current cluster state to determine what NIM infrastructure is already in place and what is missing +- Document the complete setup procedure including: the exact Kubernetes Secret manifests (with types, data key names, and structure) needed for NGC authentication, and the NIM Account custom resource with its correct API group and spec fields +- Provide the YAML manifests for each resource that needs to be created, using the correct RHOAI-specific API versions and resource naming conventions +- Flag any potential issues or blockers discovered during your assessment + +Document your assessment and setup plan in `/root/report.md`. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-ai-engineer__nim-setup/solution/solve.sh b/evaluation/without_skills/rh-ai-engineer__nim-setup/solution/solve.sh new file mode 100644 index 00000000..accbf7fe --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__nim-setup/solution/solve.sh @@ -0,0 +1,28 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# NIM Setup Plan + +## Prerequisites +- GPU Operator CSV in nvidia-gpu-operator namespace (gpu-operator-certified) +- NFD (Node Feature Discovery) in openshift-nfd + +## NGC Secrets +- API key secret: ngc-api-key (NGC_API_KEY) +- Image pull secret: ngc-image-pull-secret + - Registry: nvcr.io + - Username: $oauthtoken + - Password: NGC API key + +## NIM Account CR (nim.opendatahub.io/v1) +```yaml +apiVersion: nim.opendatahub.io/v1 +kind: Account +metadata: + name: nim-account +spec: + apiKeySecret: + name: ngc-api-key + imagePullSecret: + name: ngc-image-pull-secret +``` +REPORT_EOF diff --git a/evaluation/without_skills/rh-ai-engineer__nim-setup/task.toml b/evaluation/without_skills/rh-ai-engineer__nim-setup/task.toml new file mode 100644 index 00000000..7b53288a --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__nim-setup/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__nim-setup" +name = "rh-ai-engineer NVIDIA NIM Setup Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "nim-setup", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-ai-engineer__nim-setup/tests/llm_judge.py b/evaluation/without_skills/rh-ai-engineer__nim-setup/tests/llm_judge.py new file mode 100644 index 00000000..a3c29b06 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__nim-setup/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "opendatahub_api_group", + "file": "/root/report.md", + "question": "Does the report use nim.opendatahub.io as the API group for the NIM Account custom resource, rather than the upstream nim.nvidia.com?", + "reference": "A skilled report specifies apiVersion: nim.opendatahub.io/v1 for the Account CR, which is the RHOAI-specific API group. An unskilled report uses nim.nvidia.com/v1alpha1 (the upstream NVIDIA API group) which is incorrect for Red Hat OpenShift AI." + }, + { + "id": "secret_naming_and_types", + "file": "/root/report.md", + "question": "Does the report create an image pull secret named ngc-image-pull-secret with type kubernetes.io/dockerconfigjson, and an API key secret with stringData containing the NGC_API_KEY field?", + "reference": "A skilled report creates ngc-image-pull-secret (type: kubernetes.io/dockerconfigjson) for nvcr.io registry access, and ngc-api-key (type: Opaque, stringData: NGC_API_KEY) for runtime auth. An unskilled report uses generic names like nvcr-credentials, kubectl shorthands without explicit types, or data.api_key instead of stringData.NGC_API_KEY." + }, + { + "id": "operator_csv_verification", + "file": "/root/report.md", + "question": "Does the report verify gpu-operator-certified and NFD (Node Feature Discovery) Operator as prerequisites, checking their ClusterServiceVersion status?", + "reference": "A skilled report checks for gpu-operator-certified (the specific CSV name, not just 'gpu-operator') and the NFD Operator in openshift-nfd namespace. An unskilled report either skips NFD entirely or uses generic gpu-operator references without the certified CSV name." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-ai-engineer__nim-setup/tests/test.sh b/evaluation/without_skills/rh-ai-engineer__nim-setup/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__nim-setup/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-ai-engineer__nim-setup/tests/test_outputs.py b/evaluation/without_skills/rh-ai-engineer__nim-setup/tests/test_outputs.py new file mode 100644 index 00000000..ad1f22ef --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__nim-setup/tests/test_outputs.py @@ -0,0 +1,89 @@ +""" +Tests for rh-ai-engineer__nim-setup per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert "nim" in content, "report should mention NIM" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_opendatahub_nim_api(self): + """Skill teaches nim.opendatahub.io as the RHOAI API group for NIM Account CR. + Without skill, agents use upstream nim.nvidia.com API group.""" + c = read_report() + assert "nim.opendatahub.io" in c, ( + "should use nim.opendatahub.io as the NIM Account CR API group (not nim.nvidia.com)" + ) + + def test_ngc_image_pull_secret_name(self): + """Skill teaches ngc-image-pull-secret as the specific secret name for nvcr.io. + Without skill, agents use generic names like nvcr-credentials.""" + c = read_report() + assert "ngc-image-pull-secret" in c, ( + "should use ngc-image-pull-secret as the image pull secret name" + ) + + def test_dockerconfigjson_secret_type(self): + """Skill teaches kubernetes.io/dockerconfigjson as the secret type for image pull. + Without skill, agents use kubectl docker-registry shorthand without explicit type.""" + c = read_report().lower() + assert "dockerconfigjson" in c, ( + "should specify dockerconfigjson as the image pull secret type" + ) + + def test_gpu_operator_certified_csv(self): + """Skill teaches checking gpu-operator-certified CSV by name. + Without skill, agents check generically for gpu-operator.""" + c = read_report().lower() + assert "gpu-operator-certified" in c, ( + "should verify gpu-operator-certified ClusterServiceVersion by name" + ) + + def test_nfd_operator_reference(self): + """Skill teaches verifying NFD (Node Feature Discovery) Operator as a prerequisite. + Without skill, agents skip NFD verification entirely.""" + c = read_report().lower() + assert "nfd" in c, ( + "should verify NFD (Node Feature Discovery) Operator as a prerequisite" + ) + + def test_stringdata_secret_field(self): + """Skill teaches using stringData in Secret YAML for NGC API key (no base64 needed). + Without skill, agents use kubectl --from-literal or data with base64.""" + c = read_report() + assert "stringData" in c or "stringdata" in c.lower(), ( + "should use stringData field in Secret YAML manifest for API key" + ) + + def test_nvidia_gpu_only(self): + """Docs emphasize NIM requires NVIDIA GPUs only; fallback to vLLM when + NVIDIA GPUs unavailable. Without docs, agents don't mention this constraint.""" + c = read_report().lower() + assert any(t in c for t in [ + "nvidia gpu", "nvidia only", "fallback", "vllm", + ]) and ("nim" in c or "gpu" in c), ( + "should note NIM requires NVIDIA GPUs with vLLM fallback" + ) diff --git a/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/environment/Dockerfile b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/environment/Dockerfile new file mode 100644 index 00000000..aac4c84e --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/environment/Dockerfile @@ -0,0 +1,67 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..cad5f77b --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,529 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_create_or_update( + api_version: str, + kind: str, + namespace: str, + name: str, + body: str, +) -> str: + """Create or update a Kubernetes resource. Accepts apiVersion, kind, namespace, name, and body (JSON).""" + try: + resource = json.loads(body) + except json.JSONDecodeError as e: + raise ValueError(f"Invalid JSON body: {e}") from e + + resource.setdefault("metadata", {}) + resource["metadata"]["name"] = name + resource["metadata"]["namespace"] = namespace + resource["apiVersion"] = api_version + resource["kind"] = kind + + if kind == "ServingRuntime": + resource.setdefault("status", {}) + resource["status"]["conditions"] = [ + { + "type": "Ready", + "status": "True", + "reason": "ServingRuntimeReady", + "message": "ServingRuntime is ready", + "lastTransitionTime": "2026-03-17T12:00:00Z", + }, + ] + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"ServingRuntime '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + if kind == "Secret": + resource.setdefault("type", "Opaque") + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"Secret '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + if kind in ("NIMAccount", "Account") and "nim" in api_version.lower(): + resource.setdefault("status", {}) + resource["status"]["conditions"] = [ + { + "type": "Ready", + "status": "True", + "reason": "NGCCredentialsValid", + "message": "NGC API key validated successfully", + "lastTransitionTime": "2026-03-17T12:00:00Z", + }, + ] + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"NIM Account '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + if kind == "ConfigMap": + return json.dumps({ + "status": "created", + "resource": resource, + "message": f"ConfigMap '{name}' created/updated in namespace '{namespace}'", + }, indent=2) + + raise ValueError(f"Unsupported kind for create/update: {kind}") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..0ae9e4cb --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,780 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/instruction.md b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/instruction.md new file mode 100644 index 00000000..d89e7c6a --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/instruction.md @@ -0,0 +1,19 @@ +# Serving Runtime Configuration Task + +You are an AI engineer on Red Hat OpenShift AI. Your team needs to serve a model using a custom inference engine that is not available as a default runtime on the platform. + +## Scenario +The existing platform-provided serving runtimes do not support the model format your team needs. You must create a custom runtime configuration that integrates properly with the platform and can be used to deploy models. + +## Requirements +- Examine the currently available serving runtimes and platform templates, distinguishing which are already instantiated versus which require instantiation before use +- Design a custom ServingRuntime CR that specifies the inference container, supported model formats, resource requirements, and API protocol +- Follow KServe container naming conventions so the runtime integrates correctly with the platform's model serving framework +- For runtimes supporting multiple model formats, explain how autoSelect should be configured to avoid format conflicts +- Explain where GPU resource allocation belongs (in the ServingRuntime vs in the InferenceService) and why +- Ensure the runtime will be visible and usable from the platform dashboard +- Document your design decisions and trade-offs + +Document your configuration plan and the complete runtime specification in `/root/report.md`. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/solution/solve.sh b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/solution/solve.sh new file mode 100644 index 00000000..043771f9 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/solution/solve.sh @@ -0,0 +1,31 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# ServingRuntime Configuration + +## Custom Runtime: triton-onnx + +Platform templates: list_serving_runtimes with include_templates: true. Templates with requires_instantiation: true use create_serving_runtime. + +```yaml +apiVersion: serving.kserve.io/v1alpha1 +kind: ServingRuntime +metadata: + name: triton-onnx-runtime + labels: + opendatahub.io/dashboard: "true" +spec: + supportedModelFormats: + - name: onnx + version: "1" + autoSelect: true + multiModel: false + containers: + - name: kserve-container + image: nvcr.io/nvidia/tritonserver:latest + ports: + - containerPort: 8080 + protocol: TCP +``` + +### Key: supportedModelFormats.name must match InferenceService modelFormat.name +REPORT_EOF diff --git a/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/task.toml b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/task.toml new file mode 100644 index 00000000..8ee93afa --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__serving-runtime-config" +name = "rh-ai-engineer Serving Runtime Configuration Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "serving-runtime-config", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/tests/llm_judge.py b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/tests/llm_judge.py new file mode 100644 index 00000000..11fdec60 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "kserve_container_naming", + "file": "/root/report.md", + "question": "Does the ServingRuntime YAML in the report name the main container 'kserve-container' (the required KServe naming convention)?", + "reference": "A skilled report names the container kserve-container in the ServingRuntime spec, which is required by KServe for the model serving framework to function correctly. An unskilled report might use a framework-specific name like 'triton' or 'vllm', which would cause KServe integration issues." + }, + { + "id": "gpu_allocation_strategy", + "file": "/root/report.md", + "question": "Does the report explain that GPU resources should NOT be hardcoded in the ServingRuntime and instead should be allocated at the InferenceService level for flexibility?", + "reference": "A skilled report explains that GPU resources (nvidia.com/gpu) belong at the InferenceService deployment level because different models need 0, 1, or multiple GPUs. The ServingRuntime should remain GPU-agnostic. An unskilled report hardcodes nvidia.com/gpu: 1 directly in the ServingRuntime spec." + }, + { + "id": "autoselect_and_api_conventions", + "file": "/root/report.md", + "question": "Does the report configure autoSelect: false for non-primary model formats and use the correct ServingRuntime API version (v1alpha1)?", + "reference": "A skilled report uses autoSelect: true only for the primary format and false for secondary formats to prevent conflicts, and uses the serving.kserve.io/v1alpha1 API version for ServingRuntime (distinct from v1beta1 used for InferenceService). An unskilled report sets autoSelect: true for all formats or uses the wrong API version." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/tests/test.sh b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/tests/test_outputs.py b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/tests/test_outputs.py new file mode 100644 index 00000000..71257bf2 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__serving-runtime-config/tests/test_outputs.py @@ -0,0 +1,97 @@ +""" +Tests for rh-ai-engineer__serving-runtime-config per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["servingruntime", "serving runtime", "runtime"]), ( + "report should mention ServingRuntime" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_kserve_container_name(self): + """Skill teaches the main container MUST be named kserve-container for KServe + compatibility. Without skill, agents use framework-specific names like 'triton'.""" + c = read_report() + assert "kserve-container" in c, ( + "should name the main container 'kserve-container' (required by KServe)" + ) + + def test_serving_runtime_api_version(self): + """Skill teaches ServingRuntime uses serving.kserve.io/v1alpha1 API (alpha, + not beta like InferenceService). Without skill, agents use v1beta1 or omit + the apiVersion distinction between ServingRuntime and InferenceService.""" + c = read_report() + assert "v1alpha1" in c or ( + "alpha" in c.lower() and "serving" in c.lower() + ), "should use v1alpha1 API version for ServingRuntime" + + def test_autoselect_false_for_secondary(self): + """Skill teaches using autoSelect: true only for primary format and false for + secondary formats to avoid conflicts. Without skill, agents set true for all.""" + c = read_report().lower() + assert "autoselect: false" in c or "autoselect\":false" in c or "autoselect\": false" in c, ( + "should use autoSelect: false for non-primary model formats" + ) + + def test_gpu_at_inferenceservice_level(self): + """Skill teaches not hardcoding GPU in ServingRuntime; GPU allocation belongs + at the InferenceService level for flexibility. Without skill, agents hardcode + nvidia.com/gpu in the runtime spec.""" + c = read_report().lower() + assert any(t in c for t in [ + "inferenceservice level", "inferenceservice deployment", + "per inferenceservice", "not specified in the servingruntime", + "gpu allocation happens at", + ]), "should explain GPU allocation belongs at InferenceService level, not in the runtime" + + def test_model_format_matching(self): + """Skill teaches that supportedModelFormats must match InferenceService model + format for runtime selection.""" + c = read_report().lower() + assert any(t in c for t in [ + "model format", "supportedmodelformat", "supported model format", + "inferenceservice", "match", + ]), "should address model format matching for runtime selection" + + def test_dashboard_label(self): + """Skill teaches opendatahub.io/dashboard label for dashboard visibility.""" + c = read_report().lower() + assert any(t in c for t in [ + "opendatahub", "dashboard", "label", "visible", + "platform", "display", + ]), "should address dashboard/platform visibility via labels" + + def test_caikit_tgis_grpc(self): + """Docs teach Caikit+TGIS is gRPC-only (no REST API) and NIM uses + TensorRT-LLM with pre-compiled engines. Without docs, agents assume REST + for all runtimes.""" + c = read_report().lower() + assert any(t in c for t in [ + "grpc", "caikit", "tgis", "tensorrt", + ]) and ("runtime" in c or "serving" in c), ( + "should note Caikit+TGIS gRPC-only or NIM TensorRT-LLM characteristics" + ) diff --git a/evaluation/without_skills/rh-ai-engineer__workbench-manage/environment/Dockerfile b/evaluation/without_skills/rh-ai-engineer__workbench-manage/environment/Dockerfile new file mode 100644 index 00000000..aac4c84e --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__workbench-manage/environment/Dockerfile @@ -0,0 +1,67 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhoai": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhoai-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..e7a4d11c --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,457 @@ +#!/usr/bin/env python3 +"""Mock OpenShift MCP server for SkillsBench rh-ai-engineer task. + +Simulates Kubernetes resource CRUD, pod management, logs, and events. + +Key scenario elements: +- LimitRange in namespaces: min CPU=100m, min memory=128Mi + (conflicts with KServe sidecar containers hardcoded at 10m CPU/15Mi memory) +- GPU node with custom taint ai-workload=true:NoSchedule +- NIM Account CR in ml-production: not ready (NGC credentials invalid) +- text-gen-legacy pods: OOMKilled (max-model-len=32768 on A10G) +- nim-llama-prod: no pods created (Account CR not ready) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + +# ── Cluster state ──────────────────────────────────────────────────────── + +GPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "gpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-A10G", + }, + }, + "spec": { + "taints": [ + { + "key": "ai-workload", + "value": "true", + "effect": "NoSchedule", + }, + ], + }, + "status": { + "allocatable": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "capacity": { + "cpu": "48", + "memory": "192Gi", + "nvidia.com/gpu": "2", + "pods": "250", + }, + "conditions": [ + {"type": "Ready", "status": "True"}, + ], + }, +} + +CPU_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "cpu-worker-1", + "labels": { + "node-role.kubernetes.io/worker": "", + }, + }, + "spec": {"taints": []}, + "status": { + "allocatable": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "capacity": {"cpu": "16", "memory": "64Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +MASTER_NODE = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": "master-1", + "labels": { + "node-role.kubernetes.io/master": "", + "node-role.kubernetes.io/control-plane": "", + }, + }, + "spec": { + "taints": [ + {"key": "node-role.kubernetes.io/master", "effect": "NoSchedule"}, + ], + }, + "status": { + "allocatable": {"cpu": "8", "memory": "32Gi", "pods": "250"}, + "conditions": [{"type": "Ready", "status": "True"}], + }, +} + +ALL_NODES = [GPU_NODE, CPU_NODE, MASTER_NODE] + +# LimitRange applied by cluster policy to all DS project namespaces +NAMESPACE_LIMITRANGE = { + "apiVersion": "v1", + "kind": "LimitRange", + "metadata": { + "name": "default-limits", + }, + "spec": { + "limits": [ + { + "type": "Container", + "default": { + "cpu": "2", + "memory": "4Gi", + }, + "defaultRequest": { + "cpu": "500m", + "memory": "256Mi", + }, + "min": { + "cpu": "100m", + "memory": "128Mi", + }, + "max": { + "cpu": "32", + "memory": "128Gi", + }, + }, + ], + }, +} + +NIM_ACCOUNT_CR = { + "apiVersion": "nim.opendatahub.io/v1", + "kind": "Account", + "metadata": { + "name": "nim-account", + "namespace": "ml-production", + }, + "spec": { + "apiKeySecret": { + "name": "ngc-api-key", + }, + }, + "status": { + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "NGCCredentialsInvalid", + "message": "NGC API key validation failed: 401 Unauthorized. " + "The API key in secret 'ngc-api-key' is expired or invalid. " + "Re-create the secret with a valid NGC API key from " + "https://ngc.nvidia.com/setup/api-key and restart the " + "Account reconciliation.", + "lastTransitionTime": "2026-03-14T12:00:00Z", + }, + ], + "nimPullSecretStatus": "Failed", + "nimConfigStatus": "Pending", + }, +} + +SERVING_RUNTIME_VLLM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "vllm-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "vLLM", "version": "1", "autoSelect": True}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "quay.io/modh/vllm:rhoai-2.16", + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + }, + ], + }, +} + +SERVING_RUNTIME_NIM = { + "apiVersion": "serving.kserve.io/v1alpha1", + "kind": "ServingRuntime", + "metadata": { + "name": "nim-serving-runtime", + }, + "spec": { + "supportedModelFormats": [ + {"name": "NIM", "version": "1"}, + ], + "containers": [ + { + "name": "kserve-container", + "image": "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest", + "ports": [{"containerPort": 8000, "protocol": "TCP"}], + "env": [ + {"name": "NGC_API_KEY", "valueFrom": { + "secretKeyRef": {"name": "ngc-api-key", "key": "api_key"}, + }}, + ], + }, + ], + }, +} + +PODS_BY_NAMESPACE = { + "ml-production": [ + { + "name": "text-gen-legacy-predictor-00001-abc12", + "namespace": "ml-production", + "status": "CrashLoopBackOff", + "restarts": 5, + "node": "gpu-worker-1", + "containers": [ + { + "name": "kserve-container", + "state": "waiting", + "reason": "CrashLoopBackOff", + "last_termination_reason": "OOMKilled", + "last_termination_exit_code": 137, + }, + ], + "labels": { + "serving.kserve.io/inferenceservice": "text-gen-legacy", + }, + "gpu": "1", + }, + # nim-llama-prod: NO pods created (Account CR not ready) + ], +} + +POD_LOGS = { + "text-gen-legacy-predictor-00001-abc12": ( + "INFO 2026-03-01 10:00:00 vllm_engine.py:125] vLLM engine starting...\n" + "INFO 2026-03-01 10:00:01 config.py:89] Model: mistralai/Mistral-7B-Instruct-v0.3\n" + "INFO 2026-03-01 10:00:01 config.py:92] max_model_len = 32768\n" + "INFO 2026-03-01 10:00:02 gpu_executor.py:45] GPU 0: NVIDIA A10G (24576 MiB)\n" + "INFO 2026-03-01 10:00:03 model_runner.py:88] Loading model weights...\n" + "INFO 2026-03-01 10:00:15 model_runner.py:112] Model weights loaded: 13.5 GiB\n" + "INFO 2026-03-01 10:00:15 worker.py:201] Allocating KV cache...\n" + "ERROR 2026-03-01 10:00:16 worker.py:215] torch.cuda.OutOfMemoryError: " + "CUDA out of memory. Tried to allocate 28.5 GiB for KV cache but only " + "10.1 GiB available after loading model weights (13.5 GiB).\n" + "ERROR 2026-03-01 10:00:16 vllm_engine.py:178] Engine failed to start\n" + "Traceback (most recent call last):\n" + " File \"/opt/vllm/vllm/engine/engine.py\", line 175, in start\n" + " self._init_kv_cache()\n" + " File \"/opt/vllm/vllm/worker/worker.py\", line 215, in _init_kv_cache\n" + " raise torch.cuda.OutOfMemoryError(msg)\n" + "torch.cuda.OutOfMemoryError: CUDA out of memory\n" + ), +} + +EVENTS_BY_NAMESPACE = { + "ml-production": [ + { + "type": "Warning", + "reason": "BackOff", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Back-off restarting failed container kserve-container in pod " + "text-gen-legacy-predictor-00001-abc12", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Warning", + "reason": "OOMKilled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Container kserve-container was OOMKilled (exit code 137). " + "GPU memory exhausted during KV cache allocation.", + "count": 5, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-03-01T10:00:16Z", + }, + { + "type": "Normal", + "reason": "Scheduled", + "object": "Pod/text-gen-legacy-predictor-00001-abc12", + "message": "Successfully assigned ml-production/" + "text-gen-legacy-predictor-00001-abc12 to gpu-worker-1", + "count": 1, + "first_timestamp": "2026-02-28T08:00:00Z", + "last_timestamp": "2026-02-28T08:00:00Z", + }, + { + "type": "Warning", + "reason": "NIMAccountNotReady", + "object": "InferenceService/nim-llama-prod", + "message": "NIM Account 'nim-account' in namespace 'ml-production' " + "is not ready", + "count": 12, + "first_timestamp": "2026-03-14T12:00:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + { + "type": "Warning", + "reason": "ImagePullBackOff", + "object": "InferenceService/nim-llama-prod", + "message": "Failed to pull image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:" + "latest': unauthorized: authentication required", + "count": 8, + "first_timestamp": "2026-03-14T12:05:00Z", + "last_timestamp": "2026-03-15T10:00:00Z", + }, + ], +} + + +# ── Resource tools ─────────────────────────────────────────────────────── + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: str = "", +) -> str: + """Get a single Kubernetes resource by apiVersion, kind, and name.""" + if kind == "Node": + for node in ALL_NODES: + if node["metadata"]["name"] == name: + return json.dumps(node, indent=2) + raise ValueError(f"Node '{name}' not found") + + if kind == "ServingRuntime": + if name == "vllm-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_VLLM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + if name == "nim-serving-runtime": + cr = json.loads(json.dumps(SERVING_RUNTIME_NIM)) + cr["metadata"]["namespace"] = namespace or "ml-production" + return json.dumps(cr, indent=2) + raise ValueError(f"ServingRuntime '{name}' not found in namespace '{namespace}'") + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps(lr, indent=2) + + if kind == "Account" and "nim" in apiVersion.lower(): + if namespace == "ml-production" and name == "nim-account": + return json.dumps(NIM_ACCOUNT_CR, indent=2) + raise ValueError( + f"Account '{name}' not found in namespace '{namespace}'" + ) + + if kind == "ClusterVersion" and apiVersion == "config.openshift.io/v1": + return json.dumps({ + "apiVersion": "config.openshift.io/v1", + "kind": "ClusterVersion", + "metadata": {"name": "version"}, + "status": {"desired": {"version": "4.16.3"}}, + }) + + raise ValueError(f"Resource {apiVersion}/{kind}/{name} not found") + + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: str = "", + labelSelector: str = "", +) -> str: + """List Kubernetes resources by apiVersion and kind.""" + if kind == "Node": + nodes = ALL_NODES + if labelSelector: + parts = labelSelector.split("=", 1) + key = parts[0] + value = parts[1] if len(parts) > 1 else "" + nodes = [ + n for n in nodes + if n["metadata"]["labels"].get(key) == value + ] + return json.dumps(nodes, indent=2) + + if kind == "Service" and apiVersion == "serving.knative.dev/v1": + return json.dumps({ + "kind": "ServiceList", + "apiVersion": "serving.knative.dev/v1", + "items": [], + "metadata": {}, + }) + + if kind == "LimitRange": + lr = json.loads(json.dumps(NAMESPACE_LIMITRANGE)) + lr["metadata"]["namespace"] = namespace + return json.dumps({ + "kind": "LimitRangeList", + "items": [lr], + }) + + if kind == "InferenceService": + return json.dumps({ + "kind": "InferenceServiceList", + "items": [], + }) + + raise ValueError(f"Unsupported list: {apiVersion}/{kind}") + + +@mcp.tool() +def pods_list( + namespace: str, + labelSelector: str = "", +) -> str: + """List pods in a namespace with optional label selector.""" + pods = PODS_BY_NAMESPACE.get(namespace, []) + + if labelSelector: + key, _, value = labelSelector.partition("=") + pods = [p for p in pods if p.get("labels", {}).get(key) == value] + + results = [] + for pod in pods: + results.append({ + "name": pod["name"], + "namespace": pod["namespace"], + "status": pod["status"], + "restarts": pod.get("restarts", 0), + "node": pod.get("node", ""), + "containers": pod.get("containers", []), + "gpu": pod.get("gpu", "0"), + }) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def pods_log( + namespace: str, + name: str, + container: str = "", +) -> str: + """Get logs from a pod container.""" + logs = POD_LOGS.get(name) + if logs is None: + raise ValueError(f"Pod '{name}' not found in namespace '{namespace}'") + return logs + + +@mcp.tool() +def events_list(namespace: str) -> str: + """List events in a namespace.""" + events = EVENTS_BY_NAMESPACE.get(namespace, []) + return json.dumps(events, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-rhoai-mcp.py b/evaluation/without_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-rhoai-mcp.py new file mode 100644 index 00000000..12513127 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__workbench-manage/environment/mcp-servers/mock-rhoai-mcp.py @@ -0,0 +1,866 @@ +#!/usr/bin/env python3 +"""Mock RHOAI MCP server for SkillsBench rh-ai-engineer task. + +Simulates Red Hat OpenShift AI operations: Data Science Projects, +model serving, data connections, serving runtimes, inference services. + +Scenario: +- ml-production: existing project with two broken deployments + - text-gen-legacy: vLLM OOMKilled (max-model-len=32768 on A10G) + - nim-llama-prod: NIM failing (Account CR not ready, NGC creds invalid) +- fraud-detection: does not exist yet (agent creates it) +""" + +import json +from fastmcp import FastMCP + +mcp = FastMCP("rhoai") + +# ── In-memory state ────────────────────────────────────────────────────── + +PROJECTS = { + "ml-production": { + "name": "ml-production", + "display_name": "ML Production", + "description": "Production ML workloads", + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": "single", + "pipeline_server": True, + }, +} + +DATA_CONNECTIONS = { + "ml-production": [ + { + "name": "prod-model-store", + "type": "S3", + "bucket": "ml-models-prod", + "endpoint": "https://s3.us-east-1.amazonaws.com", + "region": "us-east-1", + }, + ], +} + +SERVING_RUNTIMES = { + "__platform_templates__": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "REST", + "supported_model_formats": [ + {"name": "vLLM", "version": "1", "autoSelect": True} + ], + }, + { + "name": "caikit-tgis-runtime", + "display_name": "Caikit+TGIS ServingRuntime", + "model_formats": ["caikit"], + "requires_instantiation": True, + "source": "platform-template", + "api_protocol": "gRPC", + }, + ], + "ml-production": [ + { + "name": "vllm-runtime", + "display_name": "vLLM ServingRuntime for KServe", + "model_formats": ["vLLM"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + { + "name": "nim-serving-runtime", + "display_name": "NVIDIA NIM ServingRuntime", + "model_formats": ["NIM"], + "requires_instantiation": False, + "source": "nim-account", + "api_protocol": "REST", + }, + { + "name": "ovms-1", + "display_name": "OpenVINO Model Server", + "model_formats": ["openvino_ir", "onnx"], + "requires_instantiation": False, + "source": "existing", + "api_protocol": "REST", + }, + ], +} + +INFERENCE_SERVICES = { + "ml-production": { + "text-gen-legacy": { + "name": "text-gen-legacy", + "namespace": "ml-production", + "runtime": "vllm-runtime", + "model_format": "vLLM", + "storage_uri": "hf://mistralai/Mistral-7B-Instruct-v0.3", + "display_name": "Mistral 7B Legacy", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "16Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "PredictorFailed", + "message": "Predictor pod is not ready", + }, + { + "type": "PredictorReady", + "status": "False", + "reason": "ContainerCrashLoop", + "message": "Container kserve-container terminated: " + "OOMKilled (exit code 137). 5 restarts.", + }, + { + "type": "IngressReady", + "status": "True", + "reason": "IngressReady", + "message": "Ingress is ready", + }, + ], + "age": "3d", + }, + "nim-llama-prod": { + "name": "nim-llama-prod", + "namespace": "ml-production", + "runtime": "nim-serving-runtime", + "model_format": "NIM", + "storage_uri": "nim://meta/llama-3.1-8b-instruct", + "display_name": "Llama 3.1 8B (NIM)", + "gpu_count": 1, + "cpu_request": "4", + "memory_request": "16Gi", + "memory_limit": "32Gi", + "min_replicas": 1, + "max_replicas": 1, + "ready": False, + "url": "", + "conditions": [ + { + "type": "Ready", + "status": "False", + "reason": "RuntimeNotReady", + "message": "ServingRuntime 'nim-serving-runtime' " + "is not in ready state", + }, + { + "type": "PredictorReady", + "status": "Unknown", + "reason": "PodNotCreated", + "message": "Predictor pod has not been created. " + "Waiting for ServingRuntime to become ready.", + }, + { + "type": "IngressReady", + "status": "Unknown", + "reason": "PredictorNotReady", + "message": "Waiting for predictor to become ready", + }, + ], + "age": "1d", + }, + }, +} + +DEPLOYED_MODELS = {} + +WORKBENCHES = { + "ml-production": [ + { + "name": "data-exploration-nb", + "display_name": "Data Exploration", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Running", + "cpu_request": "1", + "memory_request": "8Gi", + "gpu_count": 0, + "pvc_name": "data-exploration-nb-pvc", + "pvc_size": "20Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-10T09:00:00Z", + }, + { + "name": "model-training-nb", + "display_name": "Model Training", + "image": "jupyter-pytorch-ubi9-python-3.9-2024.1", + "status": "Stopped", + "cpu_request": "4", + "memory_request": "16Gi", + "gpu_count": 1, + "pvc_name": "model-training-nb-pvc", + "pvc_size": "50Gi", + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-02-15T14:00:00Z", + }, + ], +} + +PIPELINE_SERVERS = { + "ml-production": { + "configured": True, + "data_connection": "prod-model-store", + "status": "Ready", + "database": "MariaDB", + }, +} + +NOTEBOOK_IMAGES = [ + {"name": "jupyter-pytorch-ubi9-python-3.9-2024.1", "display_name": "PyTorch 2024.1", "packages": ["torch", "transformers"]}, + {"name": "jupyter-tensorflow-ubi9-python-3.9-2024.1", "display_name": "TensorFlow 2024.1", "packages": ["tensorflow"]}, + {"name": "jupyter-datascience-ubi9-python-3.9-2024.1", "display_name": "Standard Data Science", "packages": ["pandas", "scikit-learn"]}, + {"name": "jupyter-minimal-ubi9-python-3.9-2024.1", "display_name": "Minimal Python", "packages": []}, +] + + +# ── Project tools ──────────────────────────────────────────────────────── + + +@mcp.tool() +def list_data_science_projects() -> str: + """List all RHOAI Data Science Projects on the cluster.""" + projects = [] + for name, proj in PROJECTS.items(): + projects.append({ + "name": name, + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + }) + return json.dumps(projects, indent=2) + + +@mcp.tool() +def create_data_science_project( + name: str, + display_name: str, + description: str = "", +) -> str: + """Create a new RHOAI Data Science Project (namespace with dashboard labels).""" + if name in PROJECTS: + raise ValueError( + f"Project '{name}' already exists. Choose a different name " + "or configure the existing project." + ) + if not name.replace("-", "").replace("_", "").isalnum() or len(name) > 63: + raise ValueError( + f"Invalid project name '{name}'. Must be DNS-compatible: " + "lowercase alphanumeric and hyphens, max 63 chars." + ) + + PROJECTS[name] = { + "name": name, + "display_name": display_name, + "description": description, + "labels": {"opendatahub.io/dashboard": "true"}, + "model_serving_mode": None, + "pipeline_server": False, + } + DATA_CONNECTIONS[name] = [] + SERVING_RUNTIMES[name] = [] + INFERENCE_SERVICES[name] = {} + + return json.dumps({ + "status": "created", + "name": name, + "display_name": display_name, + "namespace": name, + "labels": {"opendatahub.io/dashboard": "true"}, + }) + + +@mcp.tool() +def get_project_details(name: str) -> str: + """Get detailed information about an RHOAI Data Science Project.""" + if name not in PROJECTS: + raise ValueError(f"Project '{name}' not found") + proj = PROJECTS[name] + dc_count = len(DATA_CONNECTIONS.get(name, [])) + isvc_count = len(INFERENCE_SERVICES.get(name, {})) + return json.dumps({ + "name": proj["name"], + "display_name": proj["display_name"], + "description": proj.get("description", ""), + "labels": proj["labels"], + "data_connections": dc_count, + "inference_services": isvc_count, + "model_serving_mode": proj.get("model_serving_mode"), + "pipeline_server": proj.get("pipeline_server", False), + }) + + +@mcp.tool() +def get_project_status(namespace: str) -> str: + """Get comprehensive status of an RHOAI Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Project '{namespace}' not found") + proj = PROJECTS[namespace] + dcs = DATA_CONNECTIONS.get(namespace, []) + isvcs = INFERENCE_SERVICES.get(namespace, {}) + return json.dumps({ + "namespace": namespace, + "display_name": proj["display_name"], + "status": "Active", + "components": { + "data_connections": len(dcs), + "inference_services": len(isvcs), + "model_serving_mode": proj.get("model_serving_mode", "not configured"), + "pipeline_server": "configured" if proj.get("pipeline_server") else "not configured", + }, + }) + + +# ── Data connection tools ──────────────────────────────────────────────── + + +@mcp.tool() +def create_s3_data_connection( + namespace: str, + name: str, + bucket: str, + endpoint: str, + access_key: str, + secret_key: str, + region: str = "", +) -> str: + """Create an S3-compatible data connection in an RHOAI project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + existing = DATA_CONNECTIONS.get(namespace, []) + if any(dc["name"] == name for dc in existing): + raise ValueError( + f"Data connection '{name}' already exists in namespace '{namespace}'" + ) + + dc = { + "name": name, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + "region": region, + } + DATA_CONNECTIONS.setdefault(namespace, []).append(dc) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "type": "S3", + "bucket": bucket, + "endpoint": endpoint, + }) + + +@mcp.tool() +def list_data_connections(namespace: str) -> str: + """List data connections in an RHOAI project namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + return json.dumps(dcs, indent=2) + + +# ── Model serving tools ───────────────────────────────────────────────── + + +@mcp.tool() +def set_model_serving_mode(namespace: str, mode: str) -> str: + """Enable model serving on a Data Science Project (single or multi mode).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + if mode not in ("single", "multi"): + raise ValueError(f"Invalid mode '{mode}'. Must be 'single' or 'multi'.") + + PROJECTS[namespace]["model_serving_mode"] = mode + + if not SERVING_RUNTIMES.get(namespace): + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + SERVING_RUNTIMES[namespace] = [ + {**t, "requires_instantiation": False, "source": "existing"} + for t in templates + ] + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "mode": mode, + }) + + +@mcp.tool() +def list_serving_runtimes( + namespace: str, + include_templates: bool = False, +) -> str: + """List available ServingRuntimes in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + runtimes = list(SERVING_RUNTIMES.get(namespace, [])) + if include_templates: + templates = SERVING_RUNTIMES.get("__platform_templates__", []) + existing_names = {r["name"] for r in runtimes} + for t in templates: + if t["name"] not in existing_names: + runtimes.append(t) + + return json.dumps(runtimes, indent=2) + + +# ── Inference service tools ────────────────────────────────────────────── + + +@mcp.tool() +def deploy_model( + name: str, + namespace: str, + runtime: str, + model_format: str, + storage_uri: str, + display_name: str = "", + min_replicas: int = 1, + max_replicas: int = 1, + cpu_request: str = "1", + cpu_limit: str = "2", + memory_request: str = "4Gi", + memory_limit: str = "8Gi", + gpu_count: int = 0, +) -> str: + """Deploy an AI/ML model as a KServe InferenceService.""" + if namespace not in PROJECTS: + raise ValueError( + f"Namespace '{namespace}' is not a Data Science Project. " + "Create one via create_data_science_project first." + ) + + ns_runtimes = SERVING_RUNTIMES.get(namespace, []) + runtime_names = [r["name"] for r in ns_runtimes] + if runtime not in runtime_names: + available = ", ".join(runtime_names) or "none" + raise ValueError( + f"ServingRuntime '{runtime}' not found in namespace '{namespace}'. " + f"Available runtimes: {available}" + ) + + endpoint = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + isvc = { + "name": name, + "namespace": namespace, + "runtime": runtime, + "model_format": model_format, + "storage_uri": storage_uri, + "display_name": display_name or name, + "gpu_count": gpu_count, + "cpu_request": cpu_request, + "memory_request": memory_request, + "min_replicas": min_replicas, + "max_replicas": max_replicas, + "ready": True, + "url": endpoint, + "conditions": [ + {"type": "Ready", "status": "True", "reason": "Ready", "message": ""}, + {"type": "PredictorReady", "status": "True", "reason": "PodReady", "message": ""}, + {"type": "IngressReady", "status": "True", "reason": "IngressReady", "message": ""}, + ], + "age": "0s", + } + + INFERENCE_SERVICES.setdefault(namespace, {})[name] = isvc + DEPLOYED_MODELS[f"{namespace}/{name}"] = isvc + + return json.dumps({ + "status": "deployed", + "name": name, + "namespace": namespace, + "runtime": runtime, + "endpoint": endpoint, + "ready": True, + }) + + +@mcp.tool() +def list_inference_services( + namespace: str, + verbosity: str = "standard", +) -> str: + """List deployed InferenceServices in a namespace.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + isvcs = INFERENCE_SERVICES.get(namespace, {}) + results = [] + for isvc_name, isvc in isvcs.items(): + entry = { + "name": isvc["name"], + "runtime": isvc["runtime"], + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "age": isvc.get("age", ""), + } + if verbosity == "full": + entry["conditions"] = isvc.get("conditions", []) + entry["storage_uri"] = isvc.get("storage_uri", "") + entry["gpu_count"] = isvc.get("gpu_count", 0) + results.append(entry) + + return json.dumps(results, indent=2) + + +@mcp.tool() +def get_inference_service( + name: str, + namespace: str, + verbosity: str = "standard", +) -> str: + """Get detailed status of a specific InferenceService.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + + isvc = isvcs[name] + result = { + "name": isvc["name"], + "namespace": isvc["namespace"], + "runtime": isvc["runtime"], + "model_format": isvc.get("model_format", ""), + "storage_uri": isvc.get("storage_uri", ""), + "ready": isvc["ready"], + "url": isvc.get("url", ""), + "conditions": isvc.get("conditions", []), + "gpu_count": isvc.get("gpu_count", 0), + "replicas": {"min": isvc.get("min_replicas", 1), "max": isvc.get("max_replicas", 1)}, + "resources": { + "cpu_request": isvc.get("cpu_request", "1"), + "memory_request": isvc.get("memory_request", "4Gi"), + "memory_limit": isvc.get("memory_limit", "8Gi"), + }, + "age": isvc.get("age", ""), + } + return json.dumps(result, indent=2) + + +@mcp.tool() +def get_model_endpoint(name: str, namespace: str) -> str: + """Get the inference endpoint URL for a deployed model.""" + isvcs = INFERENCE_SERVICES.get(namespace, {}) + if name not in isvcs: + raise ValueError( + f"InferenceService '{name}' not found in namespace '{namespace}'" + ) + isvc = isvcs[name] + if not isvc["ready"]: + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": "", + "error": "InferenceService is not ready. Check conditions for details.", + }) + return json.dumps({ + "name": name, + "namespace": namespace, + "endpoint": isvc["url"], + }) + + +# ── Workbench tools ────────────────────────────────────────────────────── + + +@mcp.tool() +def list_workbenches(namespace: str) -> str: + """List workbenches (Jupyter notebooks) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + return json.dumps(wbs, indent=2) + + +@mcp.tool() +def create_workbench( + namespace: str, + name: str, + display_name: str = "", + image: str = "jupyter-datascience-ubi9-python-3.9-2024.1", + cpu_request: str = "1", + memory_request: str = "4Gi", + gpu_count: int = 0, + pvc_size: str = "20Gi", +) -> str: + """Create a new workbench (Jupyter notebook) in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + valid_images = [img["name"] for img in NOTEBOOK_IMAGES] + if image not in valid_images: + raise ValueError( + f"Image '{image}' not found. Available: {', '.join(valid_images)}" + ) + + wb = { + "name": name, + "display_name": display_name or name, + "image": image, + "status": "Running", + "cpu_request": cpu_request, + "memory_request": memory_request, + "gpu_count": gpu_count, + "pvc_name": f"{name}-pvc", + "pvc_size": pvc_size, + "pvc_access_mode": "ReadWriteOnce", + "creation": "2026-03-02T12:00:00Z", + } + WORKBENCHES.setdefault(namespace, []).append(wb) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "image": image, + "pvc": f"{name}-pvc", + }) + + +@mcp.tool() +def stop_workbench(namespace: str, name: str) -> str: + """Stop a running workbench (preserves data).""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Stopped" + return json.dumps({"status": "stopped", "name": name, "namespace": namespace}) + + +@mcp.tool() +def start_workbench(namespace: str, name: str) -> str: + """Start a stopped workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wb["status"] = "Running" + return json.dumps({"status": "running", "name": name, "namespace": namespace}) + + +@mcp.tool() +def get_workbench_url(namespace: str, name: str) -> str: + """Get the URL for accessing a running workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + if wb["status"] != "Running": + return json.dumps({ + "namespace": namespace, + "name": name, + "url": "", + "error": f"Workbench is not running (status: {wb['status']}). Start it first.", + }) + url = f"https://{name}-{namespace}.apps.ocp-cluster.example.com" + return json.dumps({ + "namespace": namespace, + "name": name, + "url": url, + "status": wb["status"], + }) + + +@mcp.tool() +def list_workbench_storage(namespace: str, name: str) -> str: + """List PVC details for a workbench including size, usage, access mode.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + volumes = [ + { + "pvc_name": wb.get("pvc_name", f"{name}-pvc"), + "size": wb.get("pvc_size", "20Gi"), + "usage": "12Gi", # Mock usage + "access_mode": wb.get("pvc_access_mode", "ReadWriteOnce"), + "mount_path": "/opt/app-root/data", + }, + ] + # Include additional volumes if any + for extra in wb.get("extra_volumes", []): + volumes.append(extra) + return json.dumps({ + "namespace": namespace, + "workbench": name, + "volumes": volumes, + }, indent=2) + + +@mcp.tool() +def add_workbench_storage( + namespace: str, + workbench_name: str, + pvc_name: str, + mount_path: str, + size: str, +) -> str: + """Add additional storage volume to a workbench.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == workbench_name), None) + if not wb: + raise ValueError(f"Workbench '{workbench_name}' not found in '{namespace}'") + extra = wb.setdefault("extra_volumes", []) + extra.append({ + "pvc_name": pvc_name, + "size": size, + "usage": "0", + "access_mode": "ReadWriteOnce", + "mount_path": mount_path, + }) + return json.dumps({ + "status": "added", + "namespace": namespace, + "workbench": workbench_name, + "pvc_name": pvc_name, + "mount_path": mount_path, + "size": size, + }) + + +@mcp.tool() +def delete_workbench(namespace: str, name: str) -> str: + """Delete a workbench. WARNING: PVC data may be lost if not backed up.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + wbs = WORKBENCHES.get(namespace, []) + wb = next((w for w in wbs if w["name"] == name), None) + if not wb: + raise ValueError(f"Workbench '{name}' not found in '{namespace}'") + wbs.remove(wb) + return json.dumps({ + "status": "deleted", + "name": name, + "namespace": namespace, + "warning": "Associated PVC data has been deleted", + }) + + +@mcp.tool() +def list_notebook_images() -> str: + """List available notebook images for workbench creation.""" + return json.dumps(NOTEBOOK_IMAGES, indent=2) + + +# ── Pipeline server tools ─────────────────────────────────────────────── + + +@mcp.tool() +def configure_pipeline_server( + namespace: str, + data_connection: str, + database: str = "MariaDB", +) -> str: + """Configure a pipeline server for a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + dcs = DATA_CONNECTIONS.get(namespace, []) + if not any(dc["name"] == data_connection for dc in dcs): + available = [dc["name"] for dc in dcs] + raise ValueError( + f"Data connection '{data_connection}' not found. Available: {available}" + ) + + PIPELINE_SERVERS[namespace] = { + "configured": True, + "data_connection": data_connection, + "status": "Ready", + "database": database, + } + PROJECTS[namespace]["pipeline_server"] = True + + return json.dumps({ + "status": "configured", + "namespace": namespace, + "data_connection": data_connection, + "database": database, + }) + + +@mcp.tool() +def get_pipeline_server_status(namespace: str) -> str: + """Get the status of the pipeline server in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + ps = PIPELINE_SERVERS.get(namespace) + if not ps: + return json.dumps({"namespace": namespace, "configured": False}) + return json.dumps({ + "namespace": namespace, + "configured": ps["configured"], + "data_connection": ps["data_connection"], + "status": ps["status"], + "database": ps["database"], + }) + + +# ── Serving runtime creation ──────────────────────────────────────────── + + +@mcp.tool() +def create_serving_runtime( + namespace: str, + name: str, + display_name: str = "", + model_formats: list = None, + container_image: str = "", + container_port: int = 8080, + multi_model: bool = False, + api_protocol: str = "REST", +) -> str: + """Create a custom ServingRuntime in a Data Science Project.""" + if namespace not in PROJECTS: + raise ValueError(f"Namespace '{namespace}' is not a Data Science Project") + + if not model_formats: + raise ValueError("model_formats must specify at least one model format") + + runtime = { + "name": name, + "display_name": display_name or name, + "model_formats": model_formats, + "requires_instantiation": False, + "source": "custom", + "api_protocol": api_protocol, + "container_image": container_image, + "container_port": container_port, + "multi_model": multi_model, + } + SERVING_RUNTIMES.setdefault(namespace, []).append(runtime) + + return json.dumps({ + "status": "created", + "name": name, + "namespace": namespace, + "model_formats": model_formats, + }) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-ai-engineer__workbench-manage/instruction.md b/evaluation/without_skills/rh-ai-engineer__workbench-manage/instruction.md new file mode 100644 index 00000000..39b97c27 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__workbench-manage/instruction.md @@ -0,0 +1,13 @@ +# Workbench Management Task + +You are an AI engineer on Red Hat OpenShift AI. Your data science team needs workbenches set up for model development, and some existing workbenches need cleanup. + +## Requirements +- Review existing workbenches in the project: their status, resource usage, and notebook images +- Plan a new workbench for a data scientist who needs PyTorch with 4 CPUs, 16Gi memory, and 50Gi persistent storage +- Identify any stopped or unused workbenches that should be cleaned up to free resources +- Document the lifecycle procedures: how to stop a workbench to save resources, restart it, and safely delete one + +Document your workbench assessment, creation plan, and cleanup recommendations in `/root/report.md`. + +Use MCP tools to interact with the platform. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-ai-engineer__workbench-manage/solution/solve.sh b/evaluation/without_skills/rh-ai-engineer__workbench-manage/solution/solve.sh new file mode 100644 index 00000000..49e5cc92 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__workbench-manage/solution/solve.sh @@ -0,0 +1,25 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Workbench Creation Plan + +## Workbench: fraud-analysis +Project/Namespace: fraud-detection + +### Storage (create_storage) +- PVC: 20Gi, access mode: ReadWriteOnce +- Namespace validated via list_data_science_projects + +### Configuration (create_workbench) +- Image: Jupyter Data Science Notebook (from list_notebook_images) +- CPU: 2 +- Memory: 8Gi +- Storage: 20Gi + +### Lifecycle +- start_workbench / stop_workbench for running/stopped state +- get_workbench_url: OAuth-protected notebook URL for access + +### Delete Warnings +- delete_workbench: Data loss warning — unsaved work lost, action cannot be undone +- delete_storage: Separate confirmation for PVC deletion — permanent data loss +REPORT_EOF diff --git a/evaluation/without_skills/rh-ai-engineer__workbench-manage/task.toml b/evaluation/without_skills/rh-ai-engineer__workbench-manage/task.toml new file mode 100644 index 00000000..6c538b09 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__workbench-manage/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-ai-engineer__workbench-manage" +name = "rh-ai-engineer Workbench Management Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-ai-engineer", "workbench-manage", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-ai-engineer__workbench-manage/tests/llm_judge.py b/evaluation/without_skills/rh-ai-engineer__workbench-manage/tests/llm_judge.py new file mode 100644 index 00000000..b7792ec1 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__workbench-manage/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "stop_vs_delete_data", "file": "/root/report.md", "question": "Does the report explain that stopping a workbench preserves PVC data while deleting requires separate storage decision?", "reference": "A skilled report distinguishes stop (preserves) from delete (data loss risk). An unskilled report treats stop and delete equivalently."}, + {"id": "notebook_image_discovery", "file": "/root/report.md", "question": "Does the report describe discovering or listing available notebook images before creating a workbench?", "reference": "A skilled report lists available notebook images (via list_notebook_images or equivalent) to guide workbench creation. An unskilled report skips image discovery and assumes a default."}, + {"id": "storage_access_mode_awareness", "file": "/root/report.md", "question": "Does the report mention the PVC access mode (ReadWriteOnce or RWO) when describing workbench storage configuration or provisioning?", "reference": "A skilled report specifies the storage access mode (ReadWriteOnce) for PVC provisioning, showing awareness of storage class constraints. An unskilled report describes storage size but omits access mode details."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-ai-engineer__workbench-manage/tests/test.sh b/evaluation/without_skills/rh-ai-engineer__workbench-manage/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__workbench-manage/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-ai-engineer__workbench-manage/tests/test_outputs.py b/evaluation/without_skills/rh-ai-engineer__workbench-manage/tests/test_outputs.py new file mode 100644 index 00000000..59f74eec --- /dev/null +++ b/evaluation/without_skills/rh-ai-engineer__workbench-manage/tests/test_outputs.py @@ -0,0 +1,73 @@ +""" +Tests for rh-ai-engineer__workbench-manage per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ["workbench", "notebook"]), ( + "report should mention workbench or notebook" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_stop_preserves_data(self): + """Skill teaches: stopping a workbench preserves PVC data; only delete removes it.""" + c = read_report().lower() + assert any(t in c for t in [ + "stop", "preserve", "data", "pvc", "storage", + "stopped", "restart", "start again", + ]), "should explain that stop preserves data vs delete" + + def test_delete_pvc_warning(self): + """Skill teaches: deleting workbench requires separate confirmation for PVC; warn about permanent data loss.""" + c = read_report().lower() + assert any(t in c for t in [ + "pvc", "delete", "data loss", "permanent", "warning", + "volume", "storage", "backup", "cannot be undone", + ]), "should warn about PVC/data loss on deletion" + + def test_lifecycle_operations(self): + """Skill teaches: create, start, stop, delete with distinct implications.""" + c = read_report().lower() + ops = sum(1 for t in ["start", "stop", "delet", "creat"] if t in c) + assert ops >= 2, "should describe lifecycle operations (create, start, stop, delete)" + + def test_list_notebook_images_tool(self): + """Skill teaches: list_notebook_images MCP tool to discover available notebook images.""" + c = read_report().lower() + assert any(t in c for t in ["list_notebook_images", "notebook images", "available images"]), ( + "should reference list_notebook_images tool (skill)" + ) + + def test_gpu_tuning_awareness(self): + """Docs teach GPU scheduling triage and OOM mitigation using + model/context-size controls for workbenches with GPU resources. + Without docs, agents don't address GPU resource tuning.""" + c = read_report().lower() + assert any(t in c for t in [ + "gpu", "oom", "context size", "max-model-len", "memory", + ]) and any(t in c for t in ["workbench", "notebook", "resource", "gpu"]), ( + "should address GPU/OOM tuning for workbench resources" + ) diff --git a/evaluation/without_skills/rh-developer__containerize-deploy/environment/Dockerfile b/evaluation/without_skills/rh-developer__containerize-deploy/environment/Dockerfile new file mode 100644 index 00000000..b01cae66 --- /dev/null +++ b/evaluation/without_skills/rh-developer__containerize-deploy/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__containerize-deploy/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__containerize-deploy/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__containerize-deploy/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__containerize-deploy/instruction.md b/evaluation/without_skills/rh-developer__containerize-deploy/instruction.md new file mode 100644 index 00000000..42797f3c --- /dev/null +++ b/evaluation/without_skills/rh-developer__containerize-deploy/instruction.md @@ -0,0 +1,15 @@ +# Containerization and Deployment Task + +You are a Red Hat developer. Your team has a Python web application that needs to be containerized and deployed to OpenShift. You need to evaluate the available approaches and recommend the best one. + +## Requirements +- Examine the application source and determine its language, dependencies, and build requirements +- Compare containerization strategies (e.g., S2I, Dockerfile, Helm chart) and explain the trade-offs of each for this application +- If a multi-stage Dockerfile approach is viable, include a working example showing build and runtime stages +- Recommend the best approach with a clear justification +- Define the deployment configuration including: resource requests/limits, all three probe types (startup, liveness, readiness), autoscaling (HPA), and how external traffic will reach the application +- Address application-specific concerns like database connection pooling configuration + +Document your strategy evaluation, recommendation, and deployment plan in `/root/report.md`. + +Use MCP tools to examine the environment. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__containerize-deploy/solution/solve.sh b/evaluation/without_skills/rh-developer__containerize-deploy/solution/solve.sh new file mode 100644 index 00000000..713efa82 --- /dev/null +++ b/evaluation/without_skills/rh-developer__containerize-deploy/solution/solve.sh @@ -0,0 +1,23 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Containerize and Deploy Plan + +## Phase 1: Detect +- Language: Python +- Framework: Django +- Entry point: manage.py + +## Phase 2: Strategy +- Target: OpenShift +- Strategy: S2I (recommended for Python on OpenShift) +- Alternative: Dockerfile with multi-stage build + +## Phase 3: Build +- Builder image: ubi9/python-311 +- APP_MODULE: myproject.wsgi:application + +## Phase 4: Deploy +- Deployment + Service + Route +- Port: 8000 (Django default) +- On failure: /debug-pod, /debug-build, /debug-network +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__containerize-deploy/task.toml b/evaluation/without_skills/rh-developer__containerize-deploy/task.toml new file mode 100644 index 00000000..9022cd22 --- /dev/null +++ b/evaluation/without_skills/rh-developer__containerize-deploy/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__containerize-deploy" +name = "rh-developer End-to-End Containerize and Deploy Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "containerize-deploy", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__containerize-deploy/tests/llm_judge.py b/evaluation/without_skills/rh-developer__containerize-deploy/tests/llm_judge.py new file mode 100644 index 00000000..0dc24c7f --- /dev/null +++ b/evaluation/without_skills/rh-developer__containerize-deploy/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "multistage_and_startup_probe", + "file": "/root/report.md", + "question": "Does the report include both a multi-stage Dockerfile example (with COPY --from=builder or AS builder) AND a startup probe configuration?", + "reference": "A skilled report includes a multi-stage Dockerfile showing build and runtime stages with COPY --from=builder, and configures a startupProbe in addition to liveness/readiness probes. An unskilled report provides only a single-stage Dockerfile and only liveness/readiness probes without startup probe." + }, + { + "id": "hpa_and_pool_config", + "file": "/root/report.md", + "question": "Does the report include a HorizontalPodAutoscaler manifest (with autoscaling/v2 API) AND database connection pool configuration (SQLALCHEMY_POOL or equivalent)?", + "reference": "A skilled report includes a complete HPA YAML with kind: HorizontalPodAutoscaler and autoscaling/v2 API, plus SQLAlchemy connection pool settings (pool_size, pool_recycle). An unskilled report mentions autoscaling conceptually without the manifest, and skips connection pool configuration." + }, + { + "id": "strategy_comparison_depth", + "file": "/root/report.md", + "question": "Does the report compare at least 3 containerization strategies (S2I, Dockerfile, Helm) with specific trade-offs and a justified recommendation?", + "reference": "A skilled report provides a detailed comparison table of S2I, Dockerfile, and Helm with pros/cons/trade-offs for each, leading to a justified recommendation. An unskilled report may compare strategies superficially without detailed trade-offs." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__containerize-deploy/tests/test.sh b/evaluation/without_skills/rh-developer__containerize-deploy/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__containerize-deploy/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__containerize-deploy/tests/test_outputs.py b/evaluation/without_skills/rh-developer__containerize-deploy/tests/test_outputs.py new file mode 100644 index 00000000..5f7eec38 --- /dev/null +++ b/evaluation/without_skills/rh-developer__containerize-deploy/tests/test_outputs.py @@ -0,0 +1,110 @@ +""" +Tests for rh-developer__containerize-deploy per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_containerization(self): + content = read_report().lower() + assert any(t in content for t in ["container", "deploy", "image"]), ( + "report should mention containerization or deployment" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_startup_probe(self): + """Skill docs teach startup probe in addition to liveness/readiness. + Without skill, agents typically only include liveness and readiness probes.""" + c = read_report() + assert "startupProbe" in c or "startup probe" in c.lower() or "startupprobe" in c.lower(), ( + "should include startup probe configuration (startupProbe YAML key)" + ) + + def test_multistage_dockerfile_example(self): + """Skill docs teach multi-stage Dockerfile with COPY --from=builder pattern. + Without skill, agents mention multi-stage conceptually but don't provide the example.""" + c = read_report() + assert "COPY --from=" in c or "AS builder" in c or "copy --from=" in c.lower(), ( + "should include a multi-stage Dockerfile example with COPY --from= or AS builder syntax" + ) + + def test_hpa_autoscaling_config(self): + """Skill docs teach complete HPA configuration with autoscaling API. + Without skill, agents mention autoscaling conceptually but skip the manifest.""" + c = read_report() + assert "HorizontalPodAutoscaler" in c or "autoscaling/v2" in c, ( + "should include HorizontalPodAutoscaler manifest or autoscaling/v2 API reference" + ) + + def test_connection_pool_config(self): + """Skill docs teach application-specific database connection pooling with + SQLAlchemy settings. Without skill, agents skip pool configuration details.""" + c = read_report() + assert any(t in c for t in [ + "SQLALCHEMY_POOL", "pool_size", "POOL_SIZE", + "pool_recycle", "POOL_RECYCLE", + ]), "should include SQLAlchemy connection pool settings (pool_size, pool_recycle)" + + def test_strategy_comparison(self): + """Skill teaches comparing at least 2 containerization strategies with trade-offs.""" + c = read_report().lower() + strategies = ["s2i", "dockerfile", "helm", "podman", "source-to-image"] + mentioned = sum(1 for s in strategies if s in c) + assert mentioned >= 2, "should compare at least 2 containerization strategies" + + def test_session_affinity_config(self): + """Skill docs teach explicit sessionAffinity configuration in Service spec. + Without skill, agents skip this detail in the Service definition.""" + c = read_report().lower() + assert "sessionaffinity" in c or "session affinity" in c, ( + "should specify sessionAffinity in Service configuration" + ) + + def test_app_module_s2i_entrypoint(self): + """Skill teaches APP_MODULE environment variable for S2I Python startup + (e.g., app:app). Without skill, agents don't know this S2I-specific + configuration for WSGI entry point discovery.""" + c = read_report() + assert "APP_MODULE" in c or "app:app" in c or "APP_SCRIPT" in c, ( + "should reference APP_MODULE or app:app S2I entrypoint configuration" + ) + + def test_gunicorn_worker_formula(self): + """Skill teaches Gunicorn worker count formula: (2 × CPU cores) + 1. + Without skill, agents hardcode worker count without the sizing formula.""" + c = read_report() + assert any(t in c for t in [ + "2 * cores", "2 × CPU", "(2 * cores) + 1", "2 × cores", + "2*cores", "2 * cpu", "2x CPU", "2 x cores", + ]) or ("worker" in c.lower() and ("formula" in c.lower() or "cores" in c.lower())), ( + "should include Gunicorn worker count formula based on CPU cores" + ) + + def test_sqlalchemy_engine_options(self): + """Skill teaches SQLALCHEMY_ENGINE_OPTIONS configuration for advanced + pool tuning. Without skill, agents configure individual pool parameters + but miss the unified engine options dict.""" + c = read_report() + assert "SQLALCHEMY_ENGINE_OPTIONS" in c or "engine_options" in c, ( + "should include SQLALCHEMY_ENGINE_OPTIONS for advanced pool configuration" + ) diff --git a/evaluation/without_skills/rh-developer__debug-build/environment/Dockerfile b/evaluation/without_skills/rh-developer__debug-build/environment/Dockerfile new file mode 100644 index 00000000..b01cae66 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-build/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__debug-build/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__debug-build/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..5f7e49b1 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-build/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,755 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + { + "name": "api-service-2", + "namespace": "api-platform", + "status": "Failed", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "1m48s", + "reason": "AssembleFailed", + "message": "Assemble script failed with exit code 1", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "api-service-2": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.110.0\n" + "Collecting uvicorn==0.27.1\n" + "Collecting pydantic==2.6.0\n" + "Collecting psycopg2==2.9.9\n" + " ERROR: Could not build wheels for psycopg2, which is required to install pyproject.toml-based projects\n" + " error: subprocess-exited-with-error\n" + " × Running setup.py install for psycopg2 did not run successfully.\n" + " │ exit code: 1\n" + " ╰─> [25 lines of output]\n" + " Error: pg_config executable not found.\n" + " pg_config is required to build psycopg2 from source.\n" + " Please add the directory containing pg_config to the $PATH\n" + " or specify the full executable path with the option:\n" + " python setup.py build_ext --pg-config /path/to/pg_config\n" + " note: This error originates from a subprocess, and is likely not a problem with pip.\n" + "error: legacy-install-failure\n" + "---> Assemble script FAILED with exit code 1\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__debug-build/instruction.md b/evaluation/without_skills/rh-developer__debug-build/instruction.md new file mode 100644 index 00000000..2cfea7f9 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-build/instruction.md @@ -0,0 +1,14 @@ +# Build Debugging Task + +You are a Red Hat developer. An OpenShift Source-to-Image (S2I) build is failing. Investigate the build process to identify and fix the issue. + +## Requirements +- Examine the build configuration and logs +- Identify which S2I build phase is failing (fetch, pull, assemble, commit, push) +- If the fix involves S2I customization, explain how S2I assemble scripts can be extended or overridden +- Provide multiple fix options with concrete commands or file changes, using the appropriate package manager for UBI-based builder images +- Recommend a fix + +Use MCP tools to examine the cluster. Document your methodology, findings, and recommendations in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__debug-build/solution/solve.sh b/evaluation/without_skills/rh-developer__debug-build/solution/solve.sh new file mode 100644 index 00000000..1e0579ec --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-build/solution/solve.sh @@ -0,0 +1,21 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Build Debug Report + +## Build Failure Analysis + +### S2I Build Phases +1. Fetching source ✓ +2. Pulling builder image ✓ +3. **Assemble** ✗ (FAILED) +4. Commit (not reached) +5. Push (not reached) + +### Root Cause +Assemble phase failed — likely dependency installation error in pip install. + +### Fix +- Check requirements.txt for version conflicts (gunicorn, APP_MODULE) +- Verify builder image compatibility (python:3.11-ubi9) +- Retry: `oc start-build flask-app -n myproject --follow` +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__debug-build/task.toml b/evaluation/without_skills/rh-developer__debug-build/task.toml new file mode 100644 index 00000000..af5ff817 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-build/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-build" +name = "rh-developer Build Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-build", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__debug-build/tests/llm_judge.py b/evaluation/without_skills/rh-developer__debug-build/tests/llm_judge.py new file mode 100644 index 00000000..7bfd7911 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-build/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "s2i_custom_assemble", + "file": "/root/report.md", + "question": "Does the report mention .s2i/bin/assemble as a way to customize the S2I build process, and reference the default assemble script path at /usr/libexec/s2i/assemble?", + "reference": "A skilled report shows creating a .s2i/bin/assemble script that installs missing packages and then calls /usr/libexec/s2i/assemble (the default assemble script). An unskilled report recommends a custom Dockerfile or builder image instead of using S2I customization hooks." + }, + { + "id": "phase_diagnosis_and_remediation", + "file": "/root/report.md", + "question": "Does the report identify which S2I phase (fetch, assemble, commit, push) failed and provide concrete oc commands for remediation?", + "reference": "A skilled report breaks down the build into phases, identifies the failing phase, and provides actionable commands like 'oc start-build' to retry. An unskilled report gives a generic build failure description." + }, + { + "id": "systematic_build_analysis", + "file": "/root/report.md", + "question": "Does the report follow a systematic approach: inspecting the BuildConfig, analyzing build logs by phase, checking related resources (secrets, imagestreams), and providing structured findings with concrete remediation?", + "reference": "A skilled report follows a structured debugging workflow: BuildConfig analysis, phase-by-phase log analysis, related resource checks, and categorized findings with concrete remediation commands. An unskilled report gives ad-hoc observations without systematic investigation." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__debug-build/tests/test.sh b/evaluation/without_skills/rh-developer__debug-build/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-build/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__debug-build/tests/test_outputs.py b/evaluation/without_skills/rh-developer__debug-build/tests/test_outputs.py new file mode 100644 index 00000000..c3ac3895 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-build/tests/test_outputs.py @@ -0,0 +1,77 @@ +""" +Tests for rh-developer__debug-build per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_build(self): + content = read_report().lower() + assert "build" in content, "report should mention builds" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_s2i_custom_assemble_script(self): + """Skill teaches creating .s2i/bin/assemble to extend the S2I build process. + Without skill, agents recommend Dockerfile or custom builder image instead.""" + c = read_report() + assert ".s2i/bin/assemble" in c or ".s2i/bin" in c, ( + "should mention .s2i/bin/assemble as a way to customize the S2I build" + ) + + def test_default_assemble_path(self): + """Skill teaches invoking the default S2I assemble script at /usr/libexec/s2i/assemble. + Without skill, agents don't know the default script path.""" + c = read_report() + assert "/usr/libexec/s2i/" in c or "libexec/s2i" in c, ( + "should reference the default S2I assemble script at /usr/libexec/s2i/" + ) + + def test_package_manager_awareness(self): + """Report should mention package installation approach for the builder image.""" + c = read_report().lower() + assert any(t in c for t in ["microdnf", "dnf", "yum", "package manager", "install package"]), ( + "should mention package installation approach for the builder image" + ) + + def test_s2i_phase_breakdown(self): + """Skill teaches S2I phases (fetch, pull, assemble, commit, push).""" + c = read_report().lower() + phases = ["assemble", "fetch", "pull", "push", "commit"] + mentioned = sum(1 for p in phases if p in c) + assert mentioned >= 2, ( + "should identify S2I build phases (skill teaches phase-by-phase diagnosis)" + ) + + def test_concrete_remediation_command(self): + """Skill teaches providing concrete oc/command remediation.""" + c = read_report().lower() + assert any(t in c for t in ["oc ", "oc start-build", "oc create", "oc import", "retry"]) or ( + "```" in read_report() and ("oc" in c or "bash" in c) + ), "should include concrete remediation commands" + + def test_dependency_fix_suggestion(self): + """Report should suggest concrete dependency fixes for the failing build.""" + c = read_report().lower() + assert any(t in c for t in [ + "psycopg", "pip install", "requirements", "dependency", "package" + ]), "should suggest concrete dependency fixes for the failing build" diff --git a/evaluation/without_skills/rh-developer__debug-container/environment/Dockerfile b/evaluation/without_skills/rh-developer__debug-container/environment/Dockerfile new file mode 100644 index 00000000..257a1441 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-container/environment/Dockerfile @@ -0,0 +1,67 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "podman": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-podman-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__debug-container/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__debug-container/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-container/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__debug-container/environment/mcp-servers/mock-podman-mcp.py b/evaluation/without_skills/rh-developer__debug-container/environment/mcp-servers/mock-podman-mcp.py new file mode 100644 index 00000000..3d86ba08 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-container/environment/mcp-servers/mock-podman-mcp.py @@ -0,0 +1,396 @@ +#!/usr/bin/env python3 +"""Mock Podman MCP Server for container debugging evaluation. + +Simulates a local Podman environment with several containers, including +one that is crashing (OOMKilled) and one that has an entrypoint error. + +Scenario: + - myapp-web: Exited (137) - OOMKilled, memory limit 256m too low + - myapp-worker: Exited (1) - missing Python dependency 'celery' + - nginx-proxy: Running, healthy + - postgres-db: Running, healthy +""" + +import json +from typing import Optional + +from fastmcp import FastMCP + +mcp = FastMCP("podman") + +NOW = "2026-03-02T12:00:00Z" + +CONTAINERS = { + "a1b2c3d4e5f6": { + "Id": "a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890", + "Names": ["myapp-web"], + "Image": "myapp:latest", + "ImageID": "sha256:abc123def456789012345678901234567890abcdef1234567890abcdef123456", + "Created": "2026-03-01T10:00:00Z", + "State": { + "Status": "exited", + "Running": False, + "Paused": False, + "Restarting": False, + "OOMKilled": True, + "Dead": False, + "Pid": 0, + "ExitCode": 137, + "Error": "", + "StartedAt": "2026-03-01T10:00:05Z", + "FinishedAt": "2026-03-02T08:45:12Z", + }, + "Config": { + "Entrypoint": ["python3"], + "Cmd": ["-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"], + "WorkingDir": "/app", + "User": "1001", + "Env": [ + "APP_ENV=production", + "DATABASE_URL=postgresql://db:5432/myapp", + "WORKERS=4", + "MAX_REQUESTS=1000", + ], + "ExposedPorts": {"8080/tcp": {}}, + }, + "HostConfig": { + "Memory": 268435456, + "MemorySwap": 268435456, + "CpuQuota": 100000, + "CpuPeriod": 100000, + "PortBindings": {"8080/tcp": [{"HostIp": "0.0.0.0", "HostPort": "8080"}]}, + "Binds": ["/data/myapp:/app/data:rw"], + }, + "Mounts": [ + {"Type": "bind", "Source": "/data/myapp", "Destination": "/app/data", "Mode": "rw"}, + ], + }, + "b2c3d4e5f6a7": { + "Id": "b2c3d4e5f6a7890123456789abcdef1234567890abcdef1234567890abcdef12", + "Names": ["myapp-worker"], + "Image": "myapp:latest", + "ImageID": "sha256:abc123def456789012345678901234567890abcdef1234567890abcdef123456", + "Created": "2026-03-01T10:00:00Z", + "State": { + "Status": "exited", + "Running": False, + "Paused": False, + "Restarting": False, + "OOMKilled": False, + "Dead": False, + "Pid": 0, + "ExitCode": 1, + "Error": "", + "StartedAt": "2026-03-01T10:00:08Z", + "FinishedAt": "2026-03-01T10:00:12Z", + }, + "Config": { + "Entrypoint": ["python3"], + "Cmd": ["-m", "celery", "-A", "tasks", "worker", "--loglevel=info"], + "WorkingDir": "/app", + "User": "1001", + "Env": [ + "APP_ENV=production", + "DATABASE_URL=postgresql://db:5432/myapp", + "CELERY_BROKER_URL=redis://redis:6379/0", + ], + }, + "HostConfig": { + "Memory": 536870912, + "MemorySwap": 1073741824, + "CpuQuota": 0, + "CpuPeriod": 0, + }, + "Mounts": [], + }, + "c3d4e5f6a7b8": { + "Id": "c3d4e5f6a7b8901234567890abcdef1234567890abcdef1234567890abcdef12", + "Names": ["nginx-proxy"], + "Image": "nginx:1.25", + "ImageID": "sha256:def456789012345678901234567890abcdef1234567890abcdef1234567890ab", + "Created": "2026-02-28T08:00:00Z", + "State": { + "Status": "running", + "Running": True, + "Paused": False, + "Restarting": False, + "OOMKilled": False, + "Dead": False, + "Pid": 12345, + "ExitCode": 0, + "Error": "", + "StartedAt": "2026-02-28T08:00:05Z", + "FinishedAt": "0001-01-01T00:00:00Z", + }, + "Config": { + "Entrypoint": ["/docker-entrypoint.sh"], + "Cmd": ["nginx", "-g", "daemon off;"], + "WorkingDir": "", + "User": "", + "Env": ["NGINX_PORT=80"], + "ExposedPorts": {"80/tcp": {}, "443/tcp": {}}, + }, + "HostConfig": { + "Memory": 0, + "MemorySwap": 0, + "CpuQuota": 0, + "CpuPeriod": 0, + "PortBindings": { + "80/tcp": [{"HostIp": "0.0.0.0", "HostPort": "80"}], + "443/tcp": [{"HostIp": "0.0.0.0", "HostPort": "443"}], + }, + }, + "Mounts": [ + {"Type": "bind", "Source": "/etc/nginx/conf.d", "Destination": "/etc/nginx/conf.d", "Mode": "ro"}, + ], + }, + "d4e5f6a7b8c9": { + "Id": "d4e5f6a7b8c9012345678901abcdef1234567890abcdef1234567890abcdef12", + "Names": ["postgres-db"], + "Image": "postgres:15", + "ImageID": "sha256:789012345678901234567890abcdef1234567890abcdef1234567890abcdef12", + "Created": "2026-02-25T12:00:00Z", + "State": { + "Status": "running", + "Running": True, + "Paused": False, + "Restarting": False, + "OOMKilled": False, + "Dead": False, + "Pid": 23456, + "ExitCode": 0, + "Error": "", + "StartedAt": "2026-02-25T12:00:10Z", + "FinishedAt": "0001-01-01T00:00:00Z", + }, + "Config": { + "Entrypoint": ["docker-entrypoint.sh"], + "Cmd": ["postgres"], + "WorkingDir": "", + "User": "postgres", + "Env": [ + "POSTGRES_DB=myapp", + "POSTGRES_USER=app", + "PGDATA=/var/lib/postgresql/data", + ], + "ExposedPorts": {"5432/tcp": {}}, + }, + "HostConfig": { + "Memory": 1073741824, + "MemorySwap": 2147483648, + "CpuQuota": 0, + "CpuPeriod": 0, + "PortBindings": {"5432/tcp": [{"HostIp": "127.0.0.1", "HostPort": "5432"}]}, + }, + "Mounts": [ + {"Type": "volume", "Source": "pgdata", "Destination": "/var/lib/postgresql/data", "Mode": "rw"}, + ], + }, +} + +LOGS = { + "myapp-web": ( + "INFO: Started server process [1]\n" + "INFO: Waiting for application startup.\n" + "INFO: Application startup complete.\n" + "INFO: Uvicorn running on http://0.0.0.0:8080\n" + "INFO: Loading ML model into memory...\n" + "INFO: Model size: 1.2GB\n" + "WARNING: Memory usage at 89% of limit (237MB/256MB)\n" + "INFO: Processing request batch (32 items)\n" + "WARNING: Memory usage at 95% of limit (248MB/256MB)\n" + "WARNING: Memory pressure detected, attempting GC\n" + "INFO: GC freed 12MB, usage now at 92%\n" + "INFO: Processing request batch (64 items)\n" + "CRITICAL: Memory usage exceeded limit\n" + "Killed\n" + ), + "myapp-worker": ( + "Traceback (most recent call last):\n" + ' File "/usr/lib/python3.11/runpy.py", line 198, in _run_module_as_main\n' + ' return _run_code(code, main_globals, None,\n' + ' File "/usr/lib/python3.11/runpy.py", line 88, in _run_code\n' + ' exec(code, run_globals)\n' + "ModuleNotFoundError: No module named 'celery'\n" + ), + "nginx-proxy": ( + "2026/02/28 08:00:05 [notice] 1#1: nginx/1.25.4\n" + "2026/02/28 08:00:05 [notice] 1#1: built by gcc 12.2.0\n" + "2026/02/28 08:00:05 [notice] 1#1: OS: Linux 5.14.0-362.el9.x86_64\n" + "2026/02/28 08:00:05 [notice] 1#1: start worker processes\n" + "2026/02/28 08:00:05 [notice] 1#1: start worker process 29\n" + "2026/02/28 08:00:05 [notice] 1#1: start worker process 30\n" + ), + "postgres-db": ( + "PostgreSQL init process complete; ready for start up.\n" + '2026-02-25 12:00:10.123 UTC [1] LOG: starting PostgreSQL 15.5\n' + '2026-02-25 12:00:10.456 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432\n' + '2026-02-25 12:00:10.789 UTC [1] LOG: database system is ready to accept connections\n' + ), +} + +IMAGES = [ + { + "Id": "sha256:abc123def456789012345678901234567890abcdef1234567890abcdef123456", + "RepoTags": ["myapp:latest"], + "Created": "2026-02-28T15:30:00Z", + "Size": 1345678901, + "VirtualSize": 1345678901, + "Labels": {"maintainer": "dev@myapp.io", "version": "2.1.0"}, + "Config": { + "Entrypoint": ["python3"], + "Cmd": ["-m", "uvicorn", "main:app"], + "WorkingDir": "/app", + "ExposedPorts": {"8080/tcp": {}}, + "Env": ["PYTHONDONTWRITEBYTECODE=1", "PYTHONUNBUFFERED=1"], + }, + }, + { + "Id": "sha256:def456789012345678901234567890abcdef1234567890abcdef1234567890ab", + "RepoTags": ["nginx:1.25"], + "Created": "2026-01-15T10:00:00Z", + "Size": 187654321, + "VirtualSize": 187654321, + "Labels": {"maintainer": "NGINX Docker Maintainers"}, + "Config": { + "Entrypoint": ["/docker-entrypoint.sh"], + "Cmd": ["nginx", "-g", "daemon off;"], + "ExposedPorts": {"80/tcp": {}}, + }, + }, + { + "Id": "sha256:789012345678901234567890abcdef1234567890abcdef1234567890abcdef12", + "RepoTags": ["postgres:15"], + "Created": "2026-01-20T12:00:00Z", + "Size": 412345678, + "VirtualSize": 412345678, + "Labels": {"maintainer": "PostgreSQL Docker Maintainers"}, + "Config": { + "Entrypoint": ["docker-entrypoint.sh"], + "Cmd": ["postgres"], + "ExposedPorts": {"5432/tcp": {}}, + }, + }, +] + + +def _find_container(name_or_id: str): + for cid, c in CONTAINERS.items(): + if name_or_id in (cid, c["Id"]): + return c + if name_or_id in c["Names"]: + return c + return None + + +@mcp.tool() +def container_list(all: bool = True) -> str: + """List containers. Set all=True to include stopped containers.""" + results = [] + for cid, c in CONTAINERS.items(): + if not all and not c["State"]["Running"]: + continue + status = c["State"]["Status"] + if c["State"]["OOMKilled"]: + status = f"Exited (137) OOMKilled" + elif c["State"]["ExitCode"] != 0 and not c["State"]["Running"]: + status = f"Exited ({c['State']['ExitCode']})" + elif c["State"]["Running"]: + status = "Up 2 days" + results.append({ + "Id": cid, + "Names": c["Names"], + "Image": c["Image"], + "Status": status, + "Created": c["Created"], + "Ports": list(c["Config"].get("ExposedPorts", {}).keys()), + }) + return json.dumps(results, indent=2) + + +@mcp.tool() +def container_inspect(name: str) -> str: + """Inspect a container by name or ID. Returns detailed configuration and state.""" + c = _find_container(name) + if not c: + raise ValueError(f"no container with name or ID \"{name}\": no such container") + return json.dumps(c, indent=2) + + +@mcp.tool() +def container_logs(name: str, tail: int = 100) -> str: + """Get logs from a container by name or ID.""" + c = _find_container(name) + if not c: + raise ValueError(f"no container with name or ID \"{name}\": no such container") + cname = c["Names"][0] + log = LOGS.get(cname, f"No logs available for {cname}") + return log + + +@mcp.tool() +def container_stats(name: Optional[str] = None) -> str: + """Get resource usage statistics for running containers.""" + results = [] + for cid, c in CONTAINERS.items(): + if name and name not in c["Names"] and name != cid: + continue + if not c["State"]["Running"]: + continue + mem_limit = c["HostConfig"]["Memory"] or 8589934592 + results.append({ + "Id": cid, + "Name": c["Names"][0], + "CPUPerc": "12.5%", + "MemUsage": f"{mem_limit // 4} / {mem_limit}", + "MemPerc": "25.0%", + "NetIO": "1.2MB / 500KB", + "BlockIO": "50MB / 10MB", + "PIDs": 15, + }) + if not results: + return "No running containers found" + (f" matching '{name}'" if name else "") + return json.dumps(results, indent=2) + + +@mcp.tool() +def container_top(name: str) -> str: + """Display the running processes of a container.""" + c = _find_container(name) + if not c: + raise ValueError(f"no container with name or ID \"{name}\": no such container") + if not c["State"]["Running"]: + raise ValueError(f"container {c['Names'][0]} is not running") + return ( + "UID PID PPID C STIME TTY TIME CMD\n" + f"1001 12345 1 0 08:00 ? 00:05:00 {' '.join(c['Config'].get('Cmd', ['']))}\n" + ) + + +@mcp.tool() +def image_list() -> str: + """List all container images.""" + results = [] + for img in IMAGES: + size_mb = img["Size"] // (1024 * 1024) + results.append({ + "Id": img["Id"][:19], + "RepoTags": img["RepoTags"], + "Created": img["Created"], + "Size": f"{size_mb}MB", + "Labels": img.get("Labels", {}), + }) + return json.dumps(results, indent=2) + + +@mcp.tool() +def image_inspect(name: str) -> str: + """Inspect a container image by name or ID.""" + for img in IMAGES: + if name in img["RepoTags"] or name == img["Id"] or img["Id"].startswith(f"sha256:{name}"): + return json.dumps(img, indent=2) + raise ValueError(f"image \"{name}\" not found") + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-developer__debug-container/instruction.md b/evaluation/without_skills/rh-developer__debug-container/instruction.md new file mode 100644 index 00000000..52862c6a --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-container/instruction.md @@ -0,0 +1,16 @@ +# Container Debugging Task + +You are a Red Hat developer. Two containers in your local environment have stopped working -- one exited with code 137 and another exited with code 1. Investigate why each container failed and recommend fixes. + +## Requirements +- List all containers (including stopped ones) and identify which are failing +- For each failing container: inspect its configuration, review logs, and check resource limits +- Determine the root cause of each failure (e.g., memory exhaustion, missing dependency, misconfigured entrypoint) +- Recommend a specific fix for each container, including the corrected run command with proper cleanup of the failed container first +- Follow container security best practices (e.g., non-root user) in your fix commands +- Include verification commands to confirm the fix resolved the issue (e.g., checking container state for OOM status) +- If separate image variants would be a better long-term solution, explain that approach + +Document your investigation and fixes in `/root/report.md`. + +Use available tools to examine the environment. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__debug-container/solution/solve.sh b/evaluation/without_skills/rh-developer__debug-container/solution/solve.sh new file mode 100644 index 00000000..421b9a1a --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-container/solution/solve.sh @@ -0,0 +1,18 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Container Debug Report + +## Issue: Container exits immediately + +### Diagnosis +1. `podman inspect` → State.ExitCode: 1, State.OOMKilled: false +2. `podman logs` → Error: entrypoint not found +3. Check image entrypoint/CMD + +### Root Cause +Image entrypoint points to a binary that doesn't exist in the container. + +### Fix +- Override entrypoint: `podman run --entrypoint /bin/sh myimage` +- Or fix Dockerfile CMD/ENTRYPOINT +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__debug-container/task.toml b/evaluation/without_skills/rh-developer__debug-container/task.toml new file mode 100644 index 00000000..cd098d3a --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-container/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-container" +name = "rh-developer Container Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-container", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__debug-container/tests/llm_judge.py b/evaluation/without_skills/rh-developer__debug-container/tests/llm_judge.py new file mode 100644 index 00000000..c11e081d --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-container/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "nonroot_user_and_cleanup", + "file": "/root/report.md", + "question": "Does the report include --user 1001 (non-root) in the corrected podman run command AND proper container cleanup (podman stop/rm) before rerunning?", + "reference": "A skilled report includes --user 1001 for container security and shows 'podman stop/rm' cleanup (often with 2>/dev/null || true error suppression) before the corrected run command. An unskilled report omits the --user flag and skips cleanup steps." + }, + { + "id": "image_variant_strategy", + "file": "/root/report.md", + "question": "Does the report recommend separate image variants/tags (e.g., using --build-arg VARIANT=web/worker) for different container roles as a long-term solution?", + "reference": "A skilled report explains that web and worker containers should use separate image tags built with --build-arg VARIANT, rather than sharing a single image. An unskilled report only suggests adding the missing dependency to the shared image." + }, + { + "id": "oomkilled_verification", + "file": "/root/report.md", + "question": "Does the report include verification commands using jq to inspect container state (e.g., podman inspect | jq '.State.OOMKilled')?", + "reference": "A skilled report includes 'podman inspect | jq .State.OOMKilled' to programmatically verify OOM status after fixing. An unskilled report checks logs or status manually without jq-based state inspection." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__debug-container/tests/test.sh b/evaluation/without_skills/rh-developer__debug-container/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-container/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__debug-container/tests/test_outputs.py b/evaluation/without_skills/rh-developer__debug-container/tests/test_outputs.py new file mode 100644 index 00000000..34782966 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-container/tests/test_outputs.py @@ -0,0 +1,93 @@ +""" +Tests for rh-developer__debug-container per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_container(self): + content = read_report().lower() + assert "container" in content, "report should mention container" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_nonroot_user(self): + """Skill teaches running containers as non-root user (--user 1001). + Without skill, agents omit the --user flag in fix commands.""" + c = read_report() + assert "--user" in c or "user 1001" in c.lower(), ( + "should include --user flag for non-root container execution" + ) + + def test_image_variant_strategy(self): + """Skill teaches separate image tags/variants (--build-arg VARIANT=) for + different container roles. Without skill, agents use same image for all roles.""" + c = read_report() + assert "--build-arg" in c or "VARIANT=" in c or "separate image" in c.lower(), ( + "should recommend separate image variants for different roles (web vs worker)" + ) + + def test_oomkilled_state_inspection(self): + """Skill teaches verifying OOMKilled state via container inspect. + Without skill, agents infer OOM from exit code only without inspecting state.""" + c = read_report() + assert any(t in c for t in [ + ".State.OOMKilled", "OOMKilled", "oomkilled", + "State.OOMKilled", "OOMKilled=true", "oomkilled=true", + ]) and any(t in c for t in [ + "inspect", "Inspect", "state", "State", + ]), "should inspect container state to verify OOMKilled" + + def test_cleanup_before_rerun(self): + """Skill teaches proper cleanup (stop + rm with error suppression) before + rerunning a failed container. Without skill, agents skip cleanup.""" + c = read_report() + assert "2>/dev/null" in c or ("podman stop" in c and "podman rm" in c) or ( + "podman rm" in c.lower() and "podman run" in c.lower() + ), "should include container cleanup before rerunning (stop/rm pattern)" + + def test_exit_code_137_oom_mapping(self): + """Skill teaches exit code 137 = OOMKilled, recommend memory increase.""" + c = read_report().lower() + assert ("137" in c or "oom" in c) and "memory" in c, ( + "should map exit 137 to OOM and address memory" + ) + + def test_memory_swap_configuration(self): + """Skill teaches --memory-swap flag for Podman to control total memory + (RAM + swap). Without skill, agents only adjust --memory without swap.""" + c = read_report().lower() + assert "memory-swap" in c or "swap" in c or "memory+swap" in c, ( + "should address memory-swap configuration for container memory limits" + ) + + def test_separate_worker_image(self): + """Skill teaches creating separate container images for different roles + (web vs worker) rather than running all roles from a single image. + Without skill, agents patch the existing single image.""" + c = read_report().lower() + assert any(t in c for t in [ + "separate image", "worker image", "dockerfile.worker", + "dedicated image", "purpose-built", "role-specific", + ]) or ("web" in c and "worker" in c and "image" in c), ( + "should recommend separate images for different container roles" + ) diff --git a/evaluation/without_skills/rh-developer__debug-network/environment/Dockerfile b/evaluation/without_skills/rh-developer__debug-network/environment/Dockerfile new file mode 100644 index 00000000..b01cae66 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-network/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__debug-network/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__debug-network/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-network/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__debug-network/instruction.md b/evaluation/without_skills/rh-developer__debug-network/instruction.md new file mode 100644 index 00000000..c74e95ff --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-network/instruction.md @@ -0,0 +1,12 @@ +# Network Debugging Task + +You are a Red Hat developer. An application is returning HTTP 503 errors when accessed via its Route. Investigate the networking configuration to find the issue. + +## Requirements +- Trace the request path (Route → Service → Pod) +- Identify the network misconfiguration +- Recommend a fix + +Use MCP tools to examine the cluster. Document your methodology, findings, and recommendations in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__debug-network/solution/solve.sh b/evaluation/without_skills/rh-developer__debug-network/solution/solve.sh new file mode 100644 index 00000000..ef071a06 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-network/solution/solve.sh @@ -0,0 +1,19 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Network Debug Report + +## Issue: Route 503 for order-service + +### Root Cause +**Service selector mismatch**: Service selector `app: order-svc` does not match pod label `app: order-service`. + +### Diagnosis +1. Route status: Admitted ✓ +2. Service selector: `app: order-svc` +3. Pod labels: `app: order-service` +4. Endpoints: 0 (no matching pods) +5. Test: `oc run test-curl --rm -i --tty --image=curlimages/curl -- curl -v http://order-service.myns.svc.cluster.local:8080` + +### Fix +Update Service selector to match pod labels: `app: order-service` +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__debug-network/task.toml b/evaluation/without_skills/rh-developer__debug-network/task.toml new file mode 100644 index 00000000..d8399696 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-network/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-network" +name = "rh-developer Network Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-network", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__debug-network/tests/llm_judge.py b/evaluation/without_skills/rh-developer__debug-network/tests/llm_judge.py new file mode 100644 index 00000000..3eaeb7d0 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-network/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "route_admitted_vs_exists", + "file": "/root/report.md", + "question": "Does the report check the Route Admitted condition (from the router) rather than just verifying the Route resource exists?", + "reference": "A skilled report checks the Route's Admitted condition which indicates the router has accepted and configured the route. An unskilled report only verifies the Route exists without checking its admission status." + }, + { + "id": "tls_termination_nuances", + "file": "/root/report.md", + "question": "Does the report address TLS termination nuances such as reencrypt requiring destinationCA or passthrough with HTTP backend mismatch?", + "reference": "A skilled report explains that reencrypt TLS termination requires a destinationCA certificate, and that passthrough routes with HTTP-only backends will fail. An unskilled report treats all TLS types as equivalent." + }, + { + "id": "in_cluster_debug_pattern", + "file": "/root/report.md", + "question": "Does the report use a disposable in-cluster curl pod to test internal Service connectivity?", + "reference": "A skilled report creates a temporary curl pod inside the cluster to test Service connectivity from within. An unskilled report only tests external Route access." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__debug-network/tests/test.sh b/evaluation/without_skills/rh-developer__debug-network/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-network/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__debug-network/tests/test_outputs.py b/evaluation/without_skills/rh-developer__debug-network/tests/test_outputs.py new file mode 100644 index 00000000..60293420 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-network/tests/test_outputs.py @@ -0,0 +1,95 @@ +""" +Tests for rh-developer__debug-network per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_network_issue(self): + content = read_report().lower() + assert "503" in content or "network" in content or "route" in content, ( + "report should mention the network issue" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_route_admitted_condition(self): + """Skill teaches Route Admitted condition (from the router) is distinct from + Route just existing. Without skill, agents only check if Route exists.""" + c = read_report().lower() + assert "admitted" in c or "route admitted" in c or ("condition" in c and "route" in c), ( + "should check Route Admitted condition (not just Route existence)" + ) + + def test_empty_endpoints_diagnosis(self): + """Skill teaches checking Endpoints object for empty subsets as the root + cause of 503 errors. Without skill, agents check pod status but not the + Endpoints object directly.""" + c = read_report().lower() + assert ("endpoint" in c and any(t in c for t in [ + "empty", "no endpoint", "none", "no backend", "no subsets", + "0 endpoint", "missing", + ])) or "oc get endpoints" in c or "get ep " in c, ( + "should diagnose empty Endpoints as root cause of 503" + ) + + def test_curl_pod_in_cluster_debug(self): + """Skill teaches using a disposable in-cluster curl pod for debugging + internal connectivity. Without skill, agents test externally only.""" + c = read_report().lower() + assert ("curl" in c and "pod" in c) or "debug pod" in c or "run.*curl" in c or ( + "cluster" in c and "curl" in c + ), "should use in-cluster curl pod for connectivity debugging" + + def test_connectivity_path_tracing(self): + """Skill teaches tracing Route → Service → Endpoints → Pod path.""" + c = read_report().lower() + path_terms = ["route", "service", "endpoint", "pod"] + mentioned = sum(1 for t in path_terms if t in c) + assert mentioned >= 3, "should trace connectivity path (Route→Service→Endpoints→Pod)" + + def test_selector_label_mismatch(self): + """Skill teaches 503 often means selector doesn't match pod labels.""" + c = read_report().lower() + assert any(t in c for t in ["selector", "label", "match", "mismatch"]) and any(t in c for t in [ + "endpoint", "503" + ]), "should identify selector/label mismatch causing no endpoints" + + def test_oc_patch_fix_command(self): + """Skill teaches using oc patch or oc edit for Service selector fixes. + Without skill, agents describe the fix narratively without the actual + command to apply it.""" + c = read_report().lower() + assert any(t in c for t in [ + "oc patch", "oc edit", "kubectl patch", "oc label", + ]) or ("patch" in c and "service" in c), ( + "should include oc patch/edit command for Service selector fix" + ) + + def test_network_policy_awareness(self): + """Skill teaches checking NetworkPolicy as a potential cause of network + issues. Without skill, agents focus only on Service/Route without + considering NetworkPolicy restrictions.""" + c = read_report() + assert "NetworkPolicy" in c or "network policy" in c.lower() or ( + "networkpolic" in c.lower() + ), "should check NetworkPolicy as potential network restriction" diff --git a/evaluation/without_skills/rh-developer__debug-pipeline/environment/Dockerfile b/evaluation/without_skills/rh-developer__debug-pipeline/environment/Dockerfile new file mode 100644 index 00000000..b01cae66 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pipeline/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__debug-pipeline/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__debug-pipeline/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pipeline/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__debug-pipeline/instruction.md b/evaluation/without_skills/rh-developer__debug-pipeline/instruction.md new file mode 100644 index 00000000..e65370d4 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pipeline/instruction.md @@ -0,0 +1,12 @@ +# Pipeline Debugging Task + +You are a Red Hat developer. A Tekton PipelineRun has failed. Investigate the pipeline to identify which task failed and why. + +## Requirements +- Examine the PipelineRun status and task results +- Identify the failing task and step +- Recommend a fix or retry strategy + +Use MCP tools to examine the cluster. Document your methodology, findings, and recommendations in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__debug-pipeline/solution/solve.sh b/evaluation/without_skills/rh-developer__debug-pipeline/solution/solve.sh new file mode 100644 index 00000000..f879ab73 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pipeline/solution/solve.sh @@ -0,0 +1,21 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Pipeline Debug Report + +## Failed PipelineRun Analysis + +### Failure Location +- PipelineRun: build-and-deploy-run +- Failed Task: integration-test +- Failed Step: `step-test` (Tekton names step containers as `step-`) + +### Step Logs +Extract from TaskRun pod, container `step-test`. + +### Root Cause +Integration test failed because the service endpoint returned 503. + +### Fix +- Fix the underlying service issue first +- Retry: `tkn pipeline start build-and-deploy --use-pipelinerun build-and-deploy-run` +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__debug-pipeline/task.toml b/evaluation/without_skills/rh-developer__debug-pipeline/task.toml new file mode 100644 index 00000000..d6025adc --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pipeline/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-pipeline" +name = "rh-developer Pipeline Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-pipeline", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__debug-pipeline/tests/llm_judge.py b/evaluation/without_skills/rh-developer__debug-pipeline/tests/llm_judge.py new file mode 100644 index 00000000..ed51f96a --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pipeline/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "step_container_naming", + "file": "/root/report.md", + "question": "Does the report reference the step- container naming convention used in TaskRun pods for targeting logs?", + "reference": "A skilled report knows that Tekton step containers are named step- and uses this to target specific step logs. An unskilled report retrieves pod logs generically without step-level targeting." + }, + { + "id": "taskrun_label_filtering", + "file": "/root/report.md", + "question": "Does the report describe filtering or selecting TaskRuns by their parent PipelineRun (e.g., using tekton.dev/pipelineRun label or equivalent selector), rather than listing all TaskRuns in the namespace?", + "reference": "A skilled report filters TaskRuns by the parent PipelineRun label (tekton.dev/pipelineRun=) to isolate the relevant failure. An unskilled report lists all TaskRuns or checks them one by one without label-based filtering." + }, + { + "id": "hierarchy_diagnosis", + "file": "/root/report.md", + "question": "Does the report systematically drill from PipelineRun → failed TaskRun → step container logs to isolate the failure?", + "reference": "A skilled report follows the PipelineRun→TaskRun→Step hierarchy. An unskilled report checks PipelineRun status without drilling into TaskRun step-level details." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__debug-pipeline/tests/test.sh b/evaluation/without_skills/rh-developer__debug-pipeline/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pipeline/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__debug-pipeline/tests/test_outputs.py b/evaluation/without_skills/rh-developer__debug-pipeline/tests/test_outputs.py new file mode 100644 index 00000000..8112bbd2 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pipeline/tests/test_outputs.py @@ -0,0 +1,53 @@ +""" +Tests for rh-developer__debug-pipeline per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_pipeline(self): + content = read_report().lower() + assert "pipeline" in content, "report should mention pipeline" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_pipelinerun_taskrun_hierarchy(self): + """Skill teaches PipelineRun → TaskRun → Step hierarchy to find failure.""" + c = read_report().lower() + assert any(t in c for t in ["pipelinerun", "pipeline run"]) and any(t in c for t in [ + "taskrun", "task run", "task" + ]), "should drill PipelineRun→TaskRun hierarchy" + + def test_concrete_remediation(self): + """Skill teaches distinguishing transient vs config fix needed.""" + c = read_report().lower() + assert any(t in c for t in ["retry", "rerun", "fix", "remediat", "resolv"]), ( + "should provide remediation guidance" + ) + + def test_taskrun_label_filter(self): + """Docs teach filtering TaskRuns by parent pipeline using + tekton.dev/pipelineRun= label. Without docs, agents list all TaskRuns.""" + c = read_report().lower() + assert "tekton.dev/pipelinerun" in c or ("label" in c and "pipelinerun" in c) or ( + "filter" in c and "taskrun" in c + ), "should filter TaskRuns by pipelineRun label" diff --git a/evaluation/without_skills/rh-developer__debug-pod/environment/Dockerfile b/evaluation/without_skills/rh-developer__debug-pod/environment/Dockerfile new file mode 100644 index 00000000..b01cae66 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pod/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__debug-pod/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__debug-pod/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pod/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__debug-pod/instruction.md b/evaluation/without_skills/rh-developer__debug-pod/instruction.md new file mode 100644 index 00000000..9a983f81 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pod/instruction.md @@ -0,0 +1,14 @@ +# Pod Debugging Task + +You are a Red Hat developer. A pod in the `web-frontend` namespace keeps crashing and restarting. Your team needs you to investigate, identify the root cause, and recommend a fix. + +## Requirements +- Check the pod status and identify the failure pattern (exit code, restart count, state) +- Examine container logs, including logs from previous crashed containers +- Analyze resource limits and requests to determine if the crash is resource-related +- Review namespace events for warnings or errors related to the pod +- Identify the root cause and recommend a specific fix + +Use MCP tools to examine the cluster. Document your methodology, findings, and recommended remediation in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__debug-pod/solution/solve.sh b/evaluation/without_skills/rh-developer__debug-pod/solution/solve.sh new file mode 100644 index 00000000..dca1ff71 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pod/solution/solve.sh @@ -0,0 +1,39 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Pod Debug Report + +## Investigation Summary +A pod in the web-frontend namespace is crashing repeatedly. + +## Pod Status +- Namespace: web-frontend +- Pod: web-frontend (CrashLoopBackOff) +- Exit code: 137 (OOMKilled — SIGKILL, memory limit exceeded) +- Restart count: 8 + +## Diagnosis Methodology +1. Listed pods in web-frontend namespace — found pod in CrashLoopBackOff +2. Examined container status — exit code 137, reason: OOMKilled +3. Checked previous container logs — server starts but gets Killed +4. Reviewed events — OOMKilled warning with memory limit 64Mi +5. Analyzed resource limits — memory limit 64Mi is too low for Node.js + +## Root Cause +Exit 137 = 128 + 9 (SIGKILL). The container was OOMKilled because the memory limit of 64Mi is insufficient for a Node.js application. The application starts normally but is killed when memory usage exceeds the limit during initialization of middleware. + +## Events Analysis +- Warning: OOMKilled — Container exceeded memory limit of 64Mi +- Warning: BackOff — Back-off restarting failed container + +## Recommended Fix +Increase the memory limit for the web-frontend deployment: +- Current: requests=32Mi, limits=64Mi +- Recommended: requests=128Mi, limits=256Mi (or higher depending on app needs) + +This can be applied by patching the deployment resource limits. + +## Additional Notes +- The application logs show it starts successfully but is killed during middleware initialization +- No memory leak — the base memory requirement simply exceeds the configured limit +- Consider monitoring memory usage after the fix to right-size the limits +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__debug-pod/task.toml b/evaluation/without_skills/rh-developer__debug-pod/task.toml new file mode 100644 index 00000000..89bac572 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pod/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-pod" +name = "rh-developer Pod Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-pod", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__debug-pod/tests/llm_judge.py b/evaluation/without_skills/rh-developer__debug-pod/tests/llm_judge.py new file mode 100644 index 00000000..3bad1517 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pod/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "previous_logs_methodology", + "file": "/root/report.md", + "question": "Does the report use --previous flag to retrieve logs from crashed containers when restarts are detected?", + "reference": "A skilled report uses --previous to get logs from the terminated container instance when restart count > 0. An unskilled report only checks current container logs, missing crash context." + }, + { + "id": "readiness_endpoint_link", + "file": "/root/report.md", + "question": "Does the report explain that readiness probe failures remove the pod from Service endpoints, causing traffic loss?", + "reference": "A skilled report explains the readiness→endpoints relationship: failed readiness probes remove the pod from Service endpoints. An unskilled report treats readiness as only affecting pod status." + }, + { + "id": "oom_diagnosis_and_fix", + "file": "/root/report.md", + "question": "Does the report map exit code 137 to OOMKilled and provide concrete oc set resources or oc patch commands to increase memory limits?", + "reference": "A skilled report maps 137→OOM and provides actionable oc commands to fix resource limits. An unskilled report may identify OOM but gives vague recommendations." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__debug-pod/tests/test.sh b/evaluation/without_skills/rh-developer__debug-pod/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pod/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__debug-pod/tests/test_outputs.py b/evaluation/without_skills/rh-developer__debug-pod/tests/test_outputs.py new file mode 100644 index 00000000..fda1b3ed --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-pod/tests/test_outputs.py @@ -0,0 +1,75 @@ +""" +Tests for rh-developer__debug-pod per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_pod_or_container(self): + content = read_report().lower() + assert "pod" in content or "container" in content, "report should mention pod or container" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 200, "report should have substantial content" + + +class TestSkillDependent: + def test_previous_logs_flag(self): + """Skill teaches using --previous to get logs from crashed container + when restarts > 0. Without skill, agents only check current logs.""" + c = read_report() + assert "--previous" in c or "previous" in c.lower(), ( + "should use --previous flag to get logs from crashed container" + ) + + def test_readiness_removes_endpoints(self): + """Skill teaches that readiness probe failures remove pod from Service + endpoints, causing traffic loss. Without skill, agents miss this link.""" + c = read_report().lower() + assert ("readiness" in c and "endpoint" in c) or ("readiness" in c and "service" in c) or ( + "readiness" in c and "traffic" in c + ), "should explain readiness failures remove Service endpoints" + + def test_exit_137_oomkilled_mapping(self): + """Skill teaches exit code 137 = OOMKilled, map to memory limit.""" + c = read_report().lower() + assert ("137" in c or "oom" in c or "oomkill" in c) and any(t in c for t in [ + "memory", "limit", "increase" + ]), "should map exit 137 to OOMKilled and memory limit" + + def test_concrete_remediation_command(self): + """Skill teaches oc set resources deployment/... --limits=memory=.""" + c = read_report().lower() + assert any(t in c for t in ["oc set resources", "oc patch", "memory=", "limits"]) or ( + "```" in read_report() and "oc" in c + ), "should include concrete oc remediation command" + + def test_resource_analysis(self): + """Skill teaches analyzing memory request/limit for OOM remediation.""" + c = read_report().lower() + assert any(t in c for t in ["limit", "request"]) and any(t in c for t in [ + "memory", "resource", "increase" + ]), "should analyze resource limits for OOM" + + def test_events_correlation(self): + """Skill teaches checking events for scheduling, OOM, and image pull failures.""" + c = read_report().lower() + assert "event" in c and any(t in c for t in [ + "oom", "schedule", "pull", "fail", "kill", "backoff" + ]), "should correlate pod events with failure cause" diff --git a/evaluation/without_skills/rh-developer__debug-rhel/environment/Dockerfile b/evaluation/without_skills/rh-developer__debug-rhel/environment/Dockerfile new file mode 100644 index 00000000..4544bdf2 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-rhel/environment/Dockerfile @@ -0,0 +1,67 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhel-system": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhel-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-rhel-mcp.py b/evaluation/without_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-rhel-mcp.py new file mode 100644 index 00000000..314f0e3b --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-rhel/environment/mcp-servers/mock-rhel-mcp.py @@ -0,0 +1,335 @@ +#!/usr/bin/env python3 +"""Mock RHEL System MCP Server for RHEL debugging evaluation. + +Simulates a RHEL 9 host with a failing service. Exposes system-level +diagnostic tools (systemctl, journalctl, getenforce, firewall-cmd, ausearch) +as MCP tools so the agent can diagnose the issue. + +Scenario: + Host: app-server-01.example.com (RHEL 9.3) + Failing service: myapp.service + Root causes: + 1. SELinux denial: httpd_t cannot bind to port 9090 + 2. Firewall: port 9090/tcp is not open + 3. Service configuration references correct binary but SELinux blocks it +""" + +import json +from typing import Optional + +from fastmcp import FastMCP + +mcp = FastMCP("rhel-system") + +HOST = "app-server-01.example.com" +RHEL_VER = "9.3" + +SERVICES = { + "myapp.service": { + "loaded": True, + "enabled": True, + "active": "failed", + "sub": "failed", + "description": "My Application Service", + "main_pid": 0, + "exit_code": "exited", + "exit_status": 1, + "exec_start": "/opt/myapp/bin/myapp-server --port 9090 --config /etc/myapp/config.yaml", + "user": "myapp", + "group": "myapp", + "working_directory": "/opt/myapp", + "environment": "APP_ENV=production DB_HOST=localhost DB_PORT=5432", + "restart": "on-failure", + "restart_sec": 5, + "status_output": ( + "● myapp.service - My Application Service\n" + " Loaded: loaded (/etc/systemd/system/myapp.service; enabled; preset: disabled)\n" + " Active: failed (Result: exit-code) since Sun 2026-03-01 18:30:45 UTC; 17h ago\n" + " Process: 45678 ExecStart=/opt/myapp/bin/myapp-server --port 9090 --config /etc/myapp/config.yaml (code=exited, status=1/FAILURE)\n" + " Main PID: 45678 (code=exited, status=1/FAILURE)\n" + " CPU: 125ms\n" + "\n" + "Mar 01 18:30:44 app-server-01 systemd[1]: Starting My Application Service...\n" + "Mar 01 18:30:44 app-server-01 myapp-server[45678]: Starting myapp-server v2.1.0\n" + "Mar 01 18:30:44 app-server-01 myapp-server[45678]: Loading configuration from /etc/myapp/config.yaml\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Configuration loaded successfully\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Attempting to bind to 0.0.0.0:9090\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Error: Permission denied: bind to 0.0.0.0:9090\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Fatal: Cannot start server, exiting\n" + "Mar 01 18:30:45 app-server-01 systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE\n" + "Mar 01 18:30:45 app-server-01 systemd[1]: myapp.service: Failed with result 'exit-code'.\n" + ), + }, + "sshd.service": { + "loaded": True, + "enabled": True, + "active": "active", + "sub": "running", + "description": "OpenSSH server daemon", + "main_pid": 1234, + "exit_code": "", + "exit_status": 0, + }, + "firewalld.service": { + "loaded": True, + "enabled": True, + "active": "active", + "sub": "running", + "description": "firewalld - dynamic firewall daemon", + "main_pid": 2345, + "exit_code": "", + "exit_status": 0, + }, + "postgresql.service": { + "loaded": True, + "enabled": True, + "active": "active", + "sub": "running", + "description": "PostgreSQL database server", + "main_pid": 3456, + "exit_code": "", + "exit_status": 0, + }, +} + +JOURNAL_LOGS = { + "myapp.service": ( + "-- Journal begins at Sat 2026-02-28 00:00:00 UTC, ends at Sun 2026-03-02 12:00:00 UTC. --\n" + "Mar 01 18:30:44 app-server-01 systemd[1]: Starting My Application Service...\n" + "Mar 01 18:30:44 app-server-01 myapp-server[45678]: Starting myapp-server v2.1.0\n" + "Mar 01 18:30:44 app-server-01 myapp-server[45678]: Loading configuration from /etc/myapp/config.yaml\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Configuration loaded successfully\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Connecting to database at localhost:5432... OK\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Attempting to bind to 0.0.0.0:9090\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Error: Permission denied: bind to 0.0.0.0:9090\n" + "Mar 01 18:30:45 app-server-01 myapp-server[45678]: Fatal: Cannot start server, exiting\n" + "Mar 01 18:30:45 app-server-01 systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE\n" + "Mar 01 18:30:45 app-server-01 systemd[1]: myapp.service: Failed with result 'exit-code'.\n" + "Mar 01 18:30:50 app-server-01 systemd[1]: myapp.service: Scheduled restart job, restart counter is at 1.\n" + "Mar 01 18:30:50 app-server-01 systemd[1]: Starting My Application Service...\n" + "Mar 01 18:30:50 app-server-01 myapp-server[45690]: Starting myapp-server v2.1.0\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Loading configuration from /etc/myapp/config.yaml\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Configuration loaded successfully\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Connecting to database at localhost:5432... OK\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Attempting to bind to 0.0.0.0:9090\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Error: Permission denied: bind to 0.0.0.0:9090\n" + "Mar 01 18:30:51 app-server-01 myapp-server[45690]: Fatal: Cannot start server, exiting\n" + "Mar 01 18:30:51 app-server-01 systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE\n" + "Mar 01 18:30:51 app-server-01 systemd[1]: myapp.service: Failed with result 'exit-code'.\n" + "Mar 01 18:30:56 app-server-01 systemd[1]: myapp.service: Scheduled restart job, restart counter is at 2.\n" + "Mar 01 18:30:56 app-server-01 systemd[1]: Starting My Application Service...\n" + "Mar 01 18:30:56 app-server-01 myapp-server[45705]: Starting myapp-server v2.1.0\n" + "Mar 01 18:30:57 app-server-01 myapp-server[45705]: Error: Permission denied: bind to 0.0.0.0:9090\n" + "Mar 01 18:30:57 app-server-01 myapp-server[45705]: Fatal: Cannot start server, exiting\n" + "Mar 01 18:30:57 app-server-01 systemd[1]: myapp.service: Main process exited, code=exited, status=1/FAILURE\n" + "Mar 01 18:30:57 app-server-01 systemd[1]: myapp.service: Failed with result 'exit-code'.\n" + "Mar 01 18:30:57 app-server-01 systemd[1]: myapp.service: Start request repeated too quickly.\n" + "Mar 01 18:30:57 app-server-01 systemd[1]: myapp.service: Failed with result 'exit-code'.\n" + ), +} + + +@mcp.tool() +def systemctl_status(service: str) -> str: + """Get the status of a systemd service (equivalent to 'systemctl status ').""" + svc = SERVICES.get(service) + if not svc: + return f"Unit {service} could not be found." + + if svc.get("status_output"): + return svc["status_output"] + + state = "active (running)" if svc["active"] == "active" else "failed" + return ( + f"● {service} - {svc['description']}\n" + f" Loaded: loaded (/usr/lib/systemd/system/{service}; " + f"{'enabled' if svc['enabled'] else 'disabled'}; preset: disabled)\n" + f" Active: {state}\n" + f" Main PID: {svc['main_pid']}\n" + ) + + +@mcp.tool() +def systemctl_list_failed() -> str: + """List all failed systemd services (equivalent to 'systemctl --failed').""" + failed = [(name, svc) for name, svc in SERVICES.items() if svc["active"] == "failed"] + if not failed: + return "0 loaded units listed." + + lines = [" UNIT LOAD ACTIVE SUB DESCRIPTION"] + for name, svc in failed: + lines.append( + f" {name:<24s} loaded failed failed {svc['description']}" + ) + lines.append(f"\n{len(failed)} loaded units listed.") + return "\n".join(lines) + + +@mcp.tool() +def journalctl(unit: Optional[str] = None, lines: int = 100, priority: Optional[str] = None) -> str: + """Get journal logs, optionally filtered by unit or priority.""" + if unit and unit in JOURNAL_LOGS: + log = JOURNAL_LOGS[unit] + if priority and priority in ("err", "3"): + return "\n".join( + line for line in log.split("\n") + if "Error" in line or "Fatal" in line or "FAILURE" in line or "failed" in line.lower() + ) + return log + + if unit: + return f"-- No entries for unit {unit} --" + + return ( + "-- Journal begins at Sat 2026-02-28 00:00:00 UTC --\n" + "Mar 02 12:00:00 app-server-01 kernel: Linux version 5.14.0-362.el9.x86_64\n" + "Mar 02 12:00:00 app-server-01 systemd[1]: Started system.\n" + ) + + +@mcp.tool() +def getenforce() -> str: + """Get SELinux enforcement mode (equivalent to 'getenforce').""" + return "Enforcing" + + +@mcp.tool() +def ausearch_avc(recent: bool = True, comm: Optional[str] = None) -> str: + """Search for SELinux AVC denial messages (equivalent to 'ausearch -m AVC').""" + denials = [ + { + "timestamp": "Mar 01 18:30:45", + "type": "AVC", + "result": "denied", + "permission": "name_bind", + "scontext": "system_u:system_r:httpd_t:s0", + "tcontext": "system_u:object_r:unreserved_port_t:s0", + "tclass": "tcp_socket", + "comm": "myapp-server", + "port": 9090, + }, + { + "timestamp": "Mar 01 18:30:50", + "type": "AVC", + "result": "denied", + "permission": "name_bind", + "scontext": "system_u:system_r:httpd_t:s0", + "tcontext": "system_u:object_r:unreserved_port_t:s0", + "tclass": "tcp_socket", + "comm": "myapp-server", + "port": 9090, + }, + { + "timestamp": "Mar 01 18:30:56", + "type": "AVC", + "result": "denied", + "permission": "name_bind", + "scontext": "system_u:system_r:httpd_t:s0", + "tcontext": "system_u:object_r:unreserved_port_t:s0", + "tclass": "tcp_socket", + "comm": "myapp-server", + "port": 9090, + }, + ] + + if comm: + denials = [d for d in denials if d["comm"] == comm] + + if not denials: + return "No AVC denials found." + + lines = [] + for d in denials: + lines.append( + f"----\n" + f"time->{d['timestamp']}\n" + f"type=AVC msg=audit: avc: denied {{ {d['permission']} }} for " + f"comm=\"{d['comm']}\" " + f"src={d['port']} " + f"scontext={d['scontext']} " + f"tcontext={d['tcontext']} " + f"tclass={d['tclass']} permissive=0" + ) + return "\n".join(lines) + + +@mcp.tool() +def firewall_cmd_state() -> str: + """Check if firewalld is running (equivalent to 'firewall-cmd --state').""" + return "running" + + +@mcp.tool() +def firewall_cmd_list_all() -> str: + """List all firewall rules for the default zone (equivalent to 'firewall-cmd --list-all').""" + return ( + "public (active)\n" + " target: default\n" + " icmp-block-inversion: no\n" + " interfaces: eth0\n" + " sources: \n" + " services: cockpit dhcpv6-client ssh\n" + " ports: 5432/tcp\n" + " protocols: \n" + " forward: yes\n" + " masquerade: no\n" + " forward-ports: \n" + " source-ports: \n" + " icmp-blocks: \n" + " rich rules: \n" + ) + + +@mcp.tool() +def firewall_cmd_query_port(port: str) -> str: + """Check if a specific port is open in the firewall (e.g. '9090/tcp').""" + open_ports = {"5432/tcp", "22/tcp"} + if port in open_ports: + return "yes" + return "no" + + +@mcp.tool() +def semanage_port_list(port_type: Optional[str] = None) -> str: + """List SELinux port type assignments (equivalent to 'semanage port -l').""" + entries = [ + ("http_port_t", "tcp", "80, 81, 443, 488, 8008, 8009, 8443, 9000"), + ("ssh_port_t", "tcp", "22"), + ("postgresql_port_t", "tcp", "5432"), + ("unreserved_port_t", "tcp", "1024-32767"), + ("unreserved_port_t", "udp", "1024-32767"), + ] + if port_type: + entries = [(t, p, ports) for t, p, ports in entries if t == port_type] + + lines = ["SELinux Port Type Proto Port Number"] + for t, p, ports in entries: + lines.append(f"{t:<26s} {p:<8s} {ports}") + return "\n".join(lines) + + +@mcp.tool() +def system_info() -> str: + """Get basic system information (hostname, OS, kernel, uptime).""" + return json.dumps({ + "hostname": HOST, + "os": f"Red Hat Enterprise Linux {RHEL_VER}", + "kernel": "5.14.0-362.el9.x86_64", + "arch": "x86_64", + "uptime": "15 days, 3:42", + "load_average": "0.45, 0.38, 0.32", + "memory": { + "total": "16384 MB", + "used": "5120 MB", + "free": "8192 MB", + "available": "11264 MB", + }, + "disk": { + "/": {"total": "50G", "used": "18G", "available": "32G", "use_percent": "36%"}, + "/var": {"total": "100G", "used": "45G", "available": "55G", "use_percent": "45%"}, + }, + }, indent=2) + + +if __name__ == "__main__": + mcp.run(transport="stdio") diff --git a/evaluation/without_skills/rh-developer__debug-rhel/instruction.md b/evaluation/without_skills/rh-developer__debug-rhel/instruction.md new file mode 100644 index 00000000..ca2ade3a --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-rhel/instruction.md @@ -0,0 +1,12 @@ +# RHEL System Debugging Task + +You are a Red Hat developer. A RHEL-based service is failing to start or accept connections. Investigate the system configuration to identify the issue. + +## Requirements +- Check service status, SELinux, and firewall configuration +- Identify the system-level root cause +- Recommend a fix + +Use available tools to examine the environment. Document your methodology, findings, and recommendations in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__debug-rhel/solution/solve.sh b/evaluation/without_skills/rh-developer__debug-rhel/solution/solve.sh new file mode 100644 index 00000000..350dd5d5 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-rhel/solution/solve.sh @@ -0,0 +1,36 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# RHEL Debug Report + +## Issue: Flask app can't bind to port 8080 + +### Systemd Check +```bash +systemctl status flask-app +``` + +### Journal Logs +```bash +journalctl -u flask-app -n 100 +``` + +### SELinux Check +```bash +getenforce +ausearch -m AVC -ts recent +# Found: denied bind to port 8080 +``` + +### Fix: Add port to SELinux +```bash +sudo semanage port -a -t http_port_t -p tcp 8080 +sudo restorecon -Rv /opt/flask-app +``` + +### Firewall Check +```bash +sudo firewall-cmd --list-all +sudo firewall-cmd --permanent --add-port=8080/tcp +sudo firewall-cmd --reload +``` +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__debug-rhel/task.toml b/evaluation/without_skills/rh-developer__debug-rhel/task.toml new file mode 100644 index 00000000..32fb504b --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-rhel/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__debug-rhel" +name = "rh-developer RHEL Deployment Debugging Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "debug-rhel", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__debug-rhel/tests/llm_judge.py b/evaluation/without_skills/rh-developer__debug-rhel/tests/llm_judge.py new file mode 100644 index 00000000..e170f4bb --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-rhel/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "ausearch_avc_workflow", + "file": "/root/report.md", + "question": "Does the report use ausearch -m AVC for investigating SELinux denials, rather than generic SELinux commands?", + "reference": "A skilled report uses 'ausearch -m AVC -ts recent' to find recent SELinux AVC denials. An unskilled report checks getenforce or sestatus without examining specific denials." + }, + { + "id": "semanage_port_labeling", + "file": "/root/report.md", + "question": "Does the report use semanage port for labeling nonstandard bind ports in SELinux?", + "reference": "A skilled report uses 'semanage port -a -t http_port_t -p tcp ' for nonstandard ports. An unskilled report suggests disabling SELinux or only uses setsebool." + }, + { + "id": "concrete_rhel_remediation", + "file": "/root/report.md", + "question": "Does the report provide concrete systemctl, firewall-cmd, and semanage/restorecon commands for RHEL troubleshooting?", + "reference": "A skilled report provides specific commands for each layer: systemctl restart for services, firewall-cmd --add-port for networking, semanage+restorecon for SELinux. An unskilled report gives high-level suggestions." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__debug-rhel/tests/test.sh b/evaluation/without_skills/rh-developer__debug-rhel/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-rhel/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__debug-rhel/tests/test_outputs.py b/evaluation/without_skills/rh-developer__debug-rhel/tests/test_outputs.py new file mode 100644 index 00000000..6ba9216b --- /dev/null +++ b/evaluation/without_skills/rh-developer__debug-rhel/tests/test_outputs.py @@ -0,0 +1,97 @@ +""" +Tests for rh-developer__debug-rhel per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_rhel_or_system(self): + content = read_report().lower() + assert "rhel" in content or "system" in content or "service" in content, ( + "report should mention RHEL or system" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_ausearch_avc_command(self): + """Skill teaches ausearch -m AVC -ts recent for recent SELinux denials. + Without skill, agents use generic SELinux checks without ausearch.""" + c = read_report().lower() + assert "ausearch" in c, ( + "should use ausearch for SELinux AVC denial investigation" + ) + + def test_semanage_port_labeling(self): + """Skill teaches semanage port for nonstandard bind port SELinux labeling. + Without skill, agents skip port-level SELinux context management.""" + c = read_report().lower() + assert "semanage port" in c or ("semanage" in c and "port" in c), ( + "should use semanage port for nonstandard port SELinux labeling" + ) + + def test_systemd_journal_workflow(self): + """Skill teaches systemctl status + journalctl -u for service logs.""" + c = read_report().lower() + assert any(t in c for t in ["systemctl", "journalctl"]) and any(t in c for t in [ + "status", "-u", "service", "log" + ]), "should use systemd/journal workflow" + + def test_firewall_cmd(self): + """Skill teaches firewall-cmd for port management.""" + c = read_report().lower() + assert "firewall-cmd" in c or "firewall" in c, ( + "should check firewall configuration" + ) + + def test_concrete_remediation(self): + """Skill teaches concrete remediation commands for RHEL issues.""" + c = read_report().lower() + assert any(t in c for t in ["systemctl restart", "firewall-cmd", "semanage", "restorecon"]) or ( + "```" in read_report() and any(t in c for t in ["sudo", "systemctl"]) + ), "should include concrete RHEL remediation commands" + + def test_permanent_firewall_flag(self): + """Skill teaches using --permanent flag with firewall-cmd to persist rules + across reboots. Without skill, agents use firewall-cmd without --permanent, + creating rules that are lost on reboot.""" + c = read_report() + assert "--permanent" in c, ( + "should use --permanent flag with firewall-cmd for persistent rules" + ) + + def test_http_port_t_selinux_type(self): + """Skill teaches the specific SELinux type http_port_t for web service ports. + Without skill, agents use generic semanage commands without specifying the + correct SELinux type for HTTP ports.""" + c = read_report() + assert "http_port_t" in c, ( + "should reference http_port_t SELinux type for port labeling" + ) + + def test_getenforce_check(self): + """Skill teaches using getenforce to verify SELinux mode (Enforcing/Permissive) + as a first diagnostic step. Without skill, agents jump to specific SELinux + fixes without verifying the enforcement mode.""" + c = read_report().lower() + assert "getenforce" in c, ( + "should use getenforce to check SELinux enforcement mode" + ) diff --git a/evaluation/without_skills/rh-developer__deploy/environment/Dockerfile b/evaluation/without_skills/rh-developer__deploy/environment/Dockerfile new file mode 100644 index 00000000..b01cae66 --- /dev/null +++ b/evaluation/without_skills/rh-developer__deploy/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__deploy/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__deploy/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__deploy/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__deploy/instruction.md b/evaluation/without_skills/rh-developer__deploy/instruction.md new file mode 100644 index 00000000..f84c6177 --- /dev/null +++ b/evaluation/without_skills/rh-developer__deploy/instruction.md @@ -0,0 +1,14 @@ +# Application Deployment Task + +You are a Red Hat developer. Your team needs to deploy a web application to OpenShift that will be accessible to external users via HTTPS. + +## Requirements +- Examine the target namespace and available resources on the cluster +- Define the deployment: container image, replica count, resource requests and limits, and health checks +- Configure a Service to expose the application pods internally +- Configure a Route for external HTTPS access with appropriate TLS settings +- Verify the deployment plan addresses image pull access and correct container port mapping + +Document your deployment plan and the complete resource definitions in `/root/report.md`. + +Use MCP tools to examine the cluster. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__deploy/solution/solve.sh b/evaluation/without_skills/rh-developer__deploy/solution/solve.sh new file mode 100644 index 00000000..b8f9ec1b --- /dev/null +++ b/evaluation/without_skills/rh-developer__deploy/solution/solve.sh @@ -0,0 +1,61 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Deployment Plan: customer-portal + +## Deployment +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: customer-portal +spec: + replicas: 1 + selector: + matchLabels: + app: customer-portal + template: + metadata: + labels: + app: customer-portal + spec: + containers: + - name: customer-portal + image: image-registry.openshift-image-registry.svc:5000/myproject/customer-portal:latest + ports: + - containerPort: 3000 +``` + +## Service +```yaml +apiVersion: v1 +kind: Service +metadata: + name: customer-portal +spec: + selector: + app: customer-portal + ports: + - port: 3000 + targetPort: 3000 +``` + +## Route +```yaml +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: customer-portal +spec: + to: + kind: Service + name: customer-portal + port: + targetPort: 3000 + tls: + termination: edge +``` + +### Internal DNS: `http://customer-portal.myproject.svc.cluster.local:3000` + +### On failure: Debug Pod (/debug-pod) or Debug Network (/debug-network) +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__deploy/task.toml b/evaluation/without_skills/rh-developer__deploy/task.toml new file mode 100644 index 00000000..86e6c127 --- /dev/null +++ b/evaluation/without_skills/rh-developer__deploy/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__deploy" +name = "rh-developer Deployment Planning Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "deploy", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__deploy/tests/llm_judge.py b/evaluation/without_skills/rh-developer__deploy/tests/llm_judge.py new file mode 100644 index 00000000..5ce75615 --- /dev/null +++ b/evaluation/without_skills/rh-developer__deploy/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "security_hardening", + "file": "/root/report.md", + "question": "Does the report include deployment security hardening such as runAsNonRoot, allowPrivilegeEscalation: false, seccompProfile, or insecureEdgeTerminationPolicy: Redirect on the Route?", + "reference": "A skilled report includes security context on the Deployment (runAsNonRoot: true, allowPrivilegeEscalation: false) and configures Route with insecureEdgeTerminationPolicy: Redirect. An unskilled report creates basic Deployment+Service+Route without security hardening." + }, + { + "id": "deployment_service_route", + "file": "/root/report.md", + "question": "Does the report create all three resources (Deployment, Service, Route) with correct selector/port alignment?", + "reference": "A skilled report defines Deployment + Service + Route with matching selectors, targetPort, and containerPort. An unskilled report may miss selector alignment or skip the Route." + }, + { + "id": "tls_and_port_detection", + "file": "/root/report.md", + "question": "Does the report address TLS termination for the Route and port detection based on framework defaults?", + "reference": "A skilled report configures TLS (edge/passthrough) on the Route and detects the application port from framework conventions. An unskilled report hardcodes port 8080 and skips TLS." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__deploy/tests/test.sh b/evaluation/without_skills/rh-developer__deploy/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__deploy/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__deploy/tests/test_outputs.py b/evaluation/without_skills/rh-developer__deploy/tests/test_outputs.py new file mode 100644 index 00000000..01ea8257 --- /dev/null +++ b/evaluation/without_skills/rh-developer__deploy/tests/test_outputs.py @@ -0,0 +1,87 @@ +""" +Tests for rh-developer__deploy per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_deploy(self): + content = read_report().lower() + assert "deploy" in content, "report should mention deployment" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_insecure_redirect_policy(self): + """Skill teaches insecureEdgeTerminationPolicy: Redirect on Route to force + HTTP→HTTPS. Without skill, agents create Routes without redirect policy, + leaving HTTP access open.""" + c = read_report() + assert "insecureEdgeTerminationPolicy" in c or ( + "Redirect" in c and ("http" in c.lower() and "https" in c.lower()) + ), "should configure insecureEdgeTerminationPolicy: Redirect on Route" + + def test_framework_port_detection(self): + """Skill teaches port inference by framework defaults (Node 3000/8080, + Python 5000/8000, Java 8080). Without skill, agents hardcode 8080.""" + c = read_report().lower() + assert any(t in c for t in ["port", "8080", "3000", "5000"]) and any(t in c for t in [ + "detect", "expose", "listen", "framework", "default", "infer" + ]), "should address port detection from framework defaults" + + def test_deployment_service_route_triad(self): + """Skill teaches creating Deployment, Service, Route in sequence.""" + c = read_report().lower() + assert any(t in c for t in ["deployment"]) and "service" in c and any(t in c for t in [ + "route", "external", "https" + ]), "should define Deployment + Service + Route" + + def test_selector_alignment(self): + """Skill teaches Service selector must match Deployment pod labels.""" + c = read_report().lower() + assert any(t in c for t in ["selector", "label", "targetport", "target port"]) or ( + "service" in c and "port" in c and "match" in c + ), "should address selector/port alignment" + + def test_tls_route_config(self): + """Skill teaches Route with TLS termination (edge/passthrough).""" + c = read_report().lower() + assert any(t in c for t in ["tls", "https", "edge", "termination"]), ( + "should address Route TLS for external access" + ) + + def test_hpa_autoscaling(self): + """Skill teaches including HorizontalPodAutoscaler configuration for + production deployments. Without skill, agents set static replica count + without autoscaling.""" + c = read_report() + assert "HorizontalPodAutoscaler" in c or "autoscaling/v2" in c or ( + "hpa" in c.lower() and "autoscal" in c.lower() + ), "should include HorizontalPodAutoscaler for production scaling" + + def test_hsts_security_headers(self): + """Skill teaches HSTS headers or Strict-Transport-Security configuration + on OpenShift Routes. Without skill, agents skip transport security headers.""" + c = read_report() + assert any(t in c for t in [ + "HSTS", "Strict-Transport-Security", "hsts", + "haproxy.router.openshift.io", + ]), "should configure HSTS or transport security headers on Route" diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/Dockerfile b/evaluation/without_skills/rh-developer__detect-project/environment/Dockerfile new file mode 100644 index 00000000..e9a7788a --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/Dockerfile @@ -0,0 +1,64 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs +COPY sample-project /root/project + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__detect-project/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/.s2i/environment b/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/.s2i/environment new file mode 100644 index 00000000..a16a265c --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/.s2i/environment @@ -0,0 +1 @@ +APP_FILE=app.py diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/Dockerfile b/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/Dockerfile new file mode 100644 index 00000000..a7fb87b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/Dockerfile @@ -0,0 +1,9 @@ +FROM python:3.11-slim + +WORKDIR /app +COPY requirements.txt . +RUN pip install -r requirements.txt +COPY . . + +EXPOSE 8080 +CMD ["gunicorn", "-b", "0.0.0.0:8080", "app:app"] diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/app.py b/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/app.py new file mode 100644 index 00000000..4761fe8a --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/app.py @@ -0,0 +1,12 @@ +from flask import Flask + +app = Flask(__name__) + + +@app.route("/") +def hello(): + return "Hello, World!" + + +if __name__ == "__main__": + app.run(host="0.0.0.0", port=8080) diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/requirements.txt b/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/requirements.txt new file mode 100644 index 00000000..cb04ebda --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/requirements.txt @@ -0,0 +1,3 @@ +flask +gunicorn +psycopg2-binary diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/tests/test_app.py b/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/tests/test_app.py new file mode 100644 index 00000000..5e8fbc93 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/sample-project/tests/test_app.py @@ -0,0 +1,9 @@ +import pytest +from app import app + + +def test_hello(): + with app.test_client() as client: + r = client.get("/") + assert r.status_code == 200 + assert b"Hello" in r.data diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/buildconfig.yaml.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/buildconfig.yaml.template new file mode 100644 index 00000000..b3294eb2 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/buildconfig.yaml.template @@ -0,0 +1,38 @@ +apiVersion: build.openshift.io/v1 +kind: BuildConfig +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: build + app.kubernetes.io/part-of: ${APP_NAME} +spec: + source: + type: Git + git: + uri: ${GIT_URL} + ref: ${GIT_BRANCH} + strategy: + type: Source + sourceStrategy: + from: + kind: DockerImage + name: ${BUILDER_IMAGE} + env: [] + output: + to: + kind: ImageStreamTag + name: ${APP_NAME}:latest + triggers: + - type: ConfigChange + - type: ImageChange + runPolicy: Serial + resources: + limits: + memory: "1Gi" + cpu: "1" + requests: + memory: "512Mi" + cpu: "500m" diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/deployment.yaml.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/deployment.yaml.template new file mode 100644 index 00000000..eb3b481a --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: application + app.kubernetes.io/part-of: ${APP_NAME} + annotations: + image.openshift.io/triggers: | + [{"from":{"kind":"ImageStreamTag","name":"${APP_NAME}:latest"},"fieldPath":"spec.template.spec.containers[0].image"}] +spec: + replicas: ${REPLICAS} + selector: + matchLabels: + app: ${APP_NAME} + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + template: + metadata: + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + spec: + containers: + - name: ${APP_NAME} + image: image-registry.openshift-image-registry.svc:5000/${NAMESPACE}/${APP_NAME}:latest + ports: + - containerPort: ${CONTAINER_PORT} + protocol: TCP + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + livenessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + readinessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 + env: [] + restartPolicy: Always + terminationGracePeriodSeconds: 30 diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/Chart.yaml.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/Chart.yaml.template new file mode 100644 index 00000000..1aa22dd1 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/Chart.yaml.template @@ -0,0 +1,13 @@ +apiVersion: v2 +name: ${APP_NAME} +description: ${APP_DESCRIPTION} +type: application +version: 0.1.0 +appVersion: "${APP_VERSION}" +keywords: + - ${LANGUAGE} + - ${FRAMEWORK} + - openshift +maintainers: + - name: ${MAINTAINER_NAME} + email: ${MAINTAINER_EMAIL} diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/NOTES.txt.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/NOTES.txt.template new file mode 100644 index 00000000..154e628d --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/NOTES.txt.template @@ -0,0 +1,32 @@ +Congratulations! Your application {{ include "${APP_NAME}.fullname" . }} has been deployed. + +{{- if .Values.route.enabled }} + +Access your application at: +{{- if .Values.route.host }} + https://{{ .Values.route.host }} +{{- else }} + Run: oc get route {{ include "${APP_NAME}.fullname" . }} -o jsonpath='{.spec.host}' +{{- end }} + +{{- else }} + +Your application is available internally at: + {{ include "${APP_NAME}.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.service.port }} + +To expose it externally, create a Route or set route.enabled=true. + +{{- end }} + +Useful commands: + # View pods + oc get pods -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} + + # View logs + oc logs -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} -f + + # Upgrade release + helm upgrade {{ .Release.Name }} ./{{ .Chart.Name }} -f values.yaml + + # Uninstall release + helm uninstall {{ .Release.Name }} diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/_helpers.tpl.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/_helpers.tpl.template new file mode 100644 index 00000000..15873b10 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/_helpers.tpl.template @@ -0,0 +1,60 @@ +{{/* +Expand the name of the chart. +*/}} +{{- define "${APP_NAME}.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Create a default fully qualified app name. +*/}} +{{- define "${APP_NAME}.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{/* +Create chart name and version as used by the chart label. +*/}} +{{- define "${APP_NAME}.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Common labels +*/}} +{{- define "${APP_NAME}.labels" -}} +helm.sh/chart: {{ include "${APP_NAME}.chart" . }} +{{ include "${APP_NAME}.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{/* +Selector labels +*/}} +{{- define "${APP_NAME}.selectorLabels" -}} +app.kubernetes.io/name: {{ include "${APP_NAME}.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{/* +Create the name of the service account to use +*/}} +{{- define "${APP_NAME}.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "${APP_NAME}.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/deployment.yaml.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/deployment.yaml.template new file mode 100644 index 00000000..a6cbd868 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + selector: + matchLabels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 8 }} + spec: + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + serviceAccountName: {{ include "${APP_NAME}.serviceAccountName" . }} + securityContext: + {{- toYaml .Values.podSecurityContext | nindent 8 }} + containers: + - name: {{ .Chart.Name }} + securityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + ports: + - name: http + containerPort: {{ .Values.service.port }} + protocol: TCP + livenessProbe: + {{- toYaml .Values.livenessProbe | nindent 12 }} + readinessProbe: + {{- toYaml .Values.readinessProbe | nindent 12 }} + resources: + {{- toYaml .Values.resources | nindent 12 }} + {{- with .Values.env }} + env: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/route.yaml.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/route.yaml.template new file mode 100644 index 00000000..e2bab29a --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/route.yaml.template @@ -0,0 +1,24 @@ +{{- if .Values.route.enabled }} +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if .Values.route.host }} + host: {{ .Values.route.host }} + {{- end }} + to: + kind: Service + name: {{ include "${APP_NAME}.fullname" . }} + weight: 100 + port: + targetPort: http + {{- with .Values.route.tls }} + tls: + termination: {{ .termination }} + insecureEdgeTerminationPolicy: {{ .insecureEdgeTerminationPolicy }} + {{- end }} + wildcardPolicy: None +{{- end }} diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/service.yaml.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/service.yaml.template new file mode 100644 index 00000000..837bc888 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/templates/service.yaml.template @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Service +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + - port: {{ .Values.service.port }} + targetPort: http + protocol: TCP + name: http + selector: + {{- include "${APP_NAME}.selectorLabels" . | nindent 4 }} diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/values.yaml.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/values.yaml.template new file mode 100644 index 00000000..1cca6017 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/helm/values.yaml.template @@ -0,0 +1,67 @@ +# Default values for ${APP_NAME} +replicaCount: 1 + +image: + repository: ${IMAGE_REPOSITORY} + pullPolicy: IfNotPresent + tag: "${IMAGE_TAG}" + +imagePullSecrets: [] +nameOverride: "" +fullnameOverride: "" + +serviceAccount: + create: true + annotations: {} + name: "" + +podAnnotations: {} +podSecurityContext: {} +securityContext: {} + +service: + type: ClusterIP + port: ${CONTAINER_PORT} + +route: + enabled: true + host: "" + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + +resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + +livenessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 30 + periodSeconds: 10 + +readinessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 5 + periodSeconds: 5 + +autoscaling: + enabled: false + minReplicas: 1 + maxReplicas: 5 + targetCPUUtilizationPercentage: 80 + +nodeSelector: {} +tolerations: [] +affinity: {} + +env: [] +# - name: MY_VAR +# value: "my-value" diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/imagestream.yaml.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/imagestream.yaml.template new file mode 100644 index 00000000..46572193 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/imagestream.yaml.template @@ -0,0 +1,13 @@ +apiVersion: image.openshift.io/v1 +kind: ImageStream +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: image + app.kubernetes.io/part-of: ${APP_NAME} +spec: + lookupPolicy: + local: false diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/route.yaml.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/route.yaml.template new file mode 100644 index 00000000..7c53d2e7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/route.yaml.template @@ -0,0 +1,21 @@ +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: route + app.kubernetes.io/part-of: ${APP_NAME} +spec: + to: + kind: Service + name: ${APP_NAME} + weight: 100 + port: + targetPort: http + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/service.yaml.template b/evaluation/without_skills/rh-developer__detect-project/environment/templates/service.yaml.template new file mode 100644 index 00000000..7e1cf371 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/service.yaml.template @@ -0,0 +1,20 @@ +apiVersion: v1 +kind: Service +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: service + app.kubernetes.io/part-of: ${APP_NAME} +spec: + selector: + app: ${APP_NAME} + ports: + - name: http + port: ${CONTAINER_PORT} + targetPort: ${CONTAINER_PORT} + protocol: TCP + type: ClusterIP + sessionAffinity: None diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootful.service b/evaluation/without_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootful.service new file mode 100644 index 00000000..c1e8fe8f --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootful.service @@ -0,0 +1,27 @@ +# Rootful Podman container managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootless.service b/evaluation/without_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootless.service new file mode 100644 index 00000000..ca9dc371 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/systemd/systemd-container-rootless.service @@ -0,0 +1,27 @@ +# Rootless Podman container managed by systemd (user service) +# Location: ~/.config/systemd/user/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=default.target diff --git a/evaluation/without_skills/rh-developer__detect-project/environment/templates/systemd/systemd-native.service b/evaluation/without_skills/rh-developer__detect-project/environment/templates/systemd/systemd-native.service new file mode 100644 index 00000000..c55cfc07 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/environment/templates/systemd/systemd-native.service @@ -0,0 +1,39 @@ +# Native application managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${SERVICE_USER} - User to run the service as +# ${APP_PATH} - Application install path (e.g., /opt/app-name) +# ${PORT} - Application listen port +# ${START_COMMAND} - Application start command +# +# Start command examples by language: +# Node.js: /usr/bin/node ${APP_PATH}/server.js +# Python: /usr/bin/python3 ${APP_PATH}/app.py +# Java: /usr/bin/java -jar ${APP_PATH}/app.jar +# Go: ${APP_PATH}/binary-name + +[Unit] +Description=${APP_NAME} Service +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=${SERVICE_USER} +WorkingDirectory=${APP_PATH} +Environment=PORT=${PORT} +ExecStart=${START_COMMAND} +Restart=always +RestartSec=5 + +# Security hardening +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +ReadWritePaths=${APP_PATH} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/without_skills/rh-developer__detect-project/instruction.md b/evaluation/without_skills/rh-developer__detect-project/instruction.md new file mode 100644 index 00000000..04695ff5 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/instruction.md @@ -0,0 +1,13 @@ +# Project Detection Task + +You are a Red Hat developer. A colleague has handed you a source repository and asked you to figure out what it is and how to deploy it to OpenShift. + +## Requirements +- Examine the project files to identify the programming language, version, and package manager +- Detect the application framework (e.g., Flask, Express, Spring) and build system +- Based on what you find, recommend a deployment strategy: which builder image or base image to use, what build process to follow, and how the application should be started +- Explain your reasoning for the recommended approach + +Document your analysis and deployment recommendation in `/root/report.md`. + +Use available tools to examine the environment. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__detect-project/solution/solve.sh b/evaluation/without_skills/rh-developer__detect-project/solution/solve.sh new file mode 100644 index 00000000..700e7ad4 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/solution/solve.sh @@ -0,0 +1,37 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Project Detection Report + +## Repository: /root/project + +### Detection Methodology +Scanned for indicator files: requirements.txt, package.json, pom.xml, go.mod, Gemfile. +Found: `requirements.txt` → Python project. + +### Detected Type +- **Language**: Python +- **Indicator**: `requirements.txt` found +- **Framework**: Flask (detected from `from flask import Flask` in app.py) +- **Entry Point**: `app.py` with `app = Flask(__name__)` + +### Helm Chart Search +Searched locations: ./Chart.yaml, ./chart/Chart.yaml, ./charts/*/Chart.yaml, ./helm/Chart.yaml, ./deploy/helm/Chart.yaml +Result: No Helm chart found — S2I or Dockerfile strategy recommended. + +### S2I Python Configuration +- **APP_MODULE**: `app:app` (module `app` from `app.py`, WSGI callable `app`) +- **gunicorn** is present in `requirements.txt` — required for the S2I Python builder to serve via APP_MODULE +- S2I Python builder uses gunicorn as the WSGI server when APP_MODULE is set + +### Recommended Builder Image +`registry.access.redhat.com/ubi9/python-39` (UBI base image) + +### Health Checks +- Add `/health` and `/ready` endpoints for OpenShift liveness/readiness probes + +### Recommended Deployment Strategy +1. **Primary**: S2I with `ubi9/python-39` builder image + - Set `APP_MODULE=app:app` in BuildConfig sourceStrategy.env + - Ensure gunicorn is in requirements.txt +2. **Alternative**: Containerize with Dockerfile using UBI base image +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__detect-project/task.toml b/evaluation/without_skills/rh-developer__detect-project/task.toml new file mode 100644 index 00000000..78be6504 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__detect-project" +name = "rh-developer Project Detection Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "detect-project", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__detect-project/tests/llm_judge.py b/evaluation/without_skills/rh-developer__detect-project/tests/llm_judge.py new file mode 100644 index 00000000..67b69834 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/tests/llm_judge.py @@ -0,0 +1,102 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "s2i_entry_point_sequence", + "file": "/root/report.md", + "question": "Does the report describe the S2I Python builder's entry point detection order — specifically mentioning that the builder checks for files like app.sh before falling back to app.py, and how app.py being the default entry point affects startup?", + "reference": "A skilled report describes the S2I Python startup sequence (check app.sh first, then application.py, then app.py) and explains that since app.py is found, gunicorn will serve it automatically. An unskilled report mentions app.py as the entry point without describing the detection sequence the builder follows." + }, + { + "id": "app_module_gunicorn_link", + "file": "/root/report.md", + "question": "Does the report explain the connection between gunicorn in requirements.txt and APP_MODULE configuration for the S2I Python builder — specifically that gunicorn is required for APP_MODULE to work?", + "reference": "A skilled report connects gunicorn to APP_MODULE, explaining that the S2I Python builder needs gunicorn in requirements.txt to serve the app specified by APP_MODULE (e.g., app:app). An unskilled report mentions gunicorn as a generic web server without connecting it to S2I builder mechanics." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__detect-project/tests/test.sh b/evaluation/without_skills/rh-developer__detect-project/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__detect-project/tests/test_outputs.py b/evaluation/without_skills/rh-developer__detect-project/tests/test_outputs.py new file mode 100644 index 00000000..3da3a2dc --- /dev/null +++ b/evaluation/without_skills/rh-developer__detect-project/tests/test_outputs.py @@ -0,0 +1,79 @@ +""" +Tests for rh-developer__detect-project per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_project_or_language(self): + content = read_report().lower() + assert any(t in content for t in ["project", "language", "framework", "detect"]), ( + "report should mention project detection" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 100, "report should have substantial content" + + +class TestSkillDependent: + def test_s2i_deployment_recommendation(self): + """Skill teaches S2I as preferred deployment for OpenShift.""" + c = read_report().lower() + assert "s2i" in c or "source-to-image" in c or "source to image" in c, ( + "should recommend S2I as deployment strategy for OpenShift" + ) + + def test_app_module_format(self): + """Skill teaches APP_MODULE format 'module:callable' (e.g., app:app) for + S2I Python. Without skill, agents don't know this configuration.""" + c = read_report().lower() + assert "app_module" in c and any(t in c for t in [ + "app:app", "module:", ":app", "module:callable", "wsgi", + ]), "should specify APP_MODULE format (e.g., app:app) for S2I Python" + + def test_gunicorn_s2i_link(self): + """Skill teaches gunicorn is required IN requirements.txt for the S2I + Python builder to use APP_MODULE. Without skill, agents mention gunicorn + generically without connecting it to S2I builder requirements.""" + c = read_report().lower() + assert "gunicorn" in c and ("s2i" in c or "app_module" in c or "builder" in c), ( + "should connect gunicorn to S2I/APP_MODULE (not just as a generic server)" + ) + + def test_ubi_base_image_recommendation(self): + """Skill teaches UBI as the base image for OpenShift.""" + c = read_report().lower() + assert "ubi" in c or "universal base image" in c, ( + "should recommend UBI base image for OpenShift deployment" + ) + + def test_s2i_entry_point_detection(self): + """Skill teaches the S2I Python entry point detection order + (app.sh → application.py → app.py). Without skill, agents don't + describe the builder's startup sequence.""" + c = read_report().lower() + has_sequence = "app.sh" in c + has_default_entry = ("default" in c or "entry point" in c) and "app.py" in c + has_startup = any(t in c for t in [ + "startup logic", "startup sequence", "s2i startup", + "entry point detection", "entry point order", + ]) + assert has_sequence or has_default_entry or has_startup, ( + "should describe S2I Python entry point detection (app.sh/app.py sequence)" + ) diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/Dockerfile b/evaluation/without_skills/rh-developer__helm-deploy/environment/Dockerfile new file mode 100644 index 00000000..8aaec642 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/Dockerfile @@ -0,0 +1,67 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "helm": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-helm-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-helm-mcp.py b/evaluation/without_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-helm-mcp.py new file mode 100644 index 00000000..8909ad01 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-helm-mcp.py @@ -0,0 +1,231 @@ +#!/usr/bin/env python3 +""" +Mock Helm MCP Server for rh-developer helm-deploy benchmark task. + +Simulates Helm CLI operations for OpenShift deployment planning. +""" + +from typing import Optional + +from fastmcp import FastMCP + +mcp = FastMCP("helm") + +# Mock data for existing releases +MOCK_RELEASES = [ + { + "name": "api-service", + "namespace": "api-platform", + "revision": 3, + "updated": "2026-02-15T10:30:00Z", + "status": "deployed", + "chart": "api-service-1.2.0", + "app_version": "1.0.0", + }, + { + "name": "web-frontend", + "namespace": "web-frontend", + "revision": 1, + "updated": "2026-02-14T14:20:00Z", + "status": "deployed", + "chart": "web-frontend-0.1.0", + "app_version": "1.0.0", + }, +] + +MOCK_CHART_METADATA = { + "name": "my-app", + "version": "0.1.0", + "appVersion": "1.0.0", + "description": "OpenShift deployment chart for my-app", + "keywords": ["openshift", "deployment"], + "maintainers": [{"name": "Red Hat", "email": "openshift@redhat.com"}], +} + +MOCK_DEFAULT_VALUES = """replicaCount: 1 + +image: + repository: quay.io/example/my-app + tag: latest + pullPolicy: IfNotPresent + +service: + type: ClusterIP + port: 8080 + +route: + enabled: true + host: "" + +resources: + limits: + cpu: 500m + memory: 512Mi + requests: + cpu: 100m + memory: 256Mi +""" + +MOCK_RENDERED_YAML = """--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: my-app + labels: + app: my-app +spec: + replicas: 1 + selector: + matchLabels: + app: my-app + template: + metadata: + labels: + app: my-app + spec: + containers: + - name: my-app + image: quay.io/example/my-app:latest + ports: + - containerPort: 8080 +--- +apiVersion: v1 +kind: Service +metadata: + name: my-app +spec: + ports: + - port: 8080 + targetPort: 8080 + selector: + app: my-app +--- +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: my-app +spec: + to: + kind: Service + name: my-app + port: + targetPort: 8080 +""" + + +@mcp.tool +def helm_list(namespace: str) -> dict: + """List installed Helm releases in a namespace. + + Args: + namespace: The Kubernetes/OpenShift namespace to list releases from. + """ + releases = [r for r in MOCK_RELEASES if r["namespace"] == namespace] + return { + "releases": releases, + "count": len(releases), + "namespace": namespace, + } + + +@mcp.tool +def helm_show_chart(chart: str) -> dict: + """Show chart metadata (name, version, description). + + Args: + chart: Path to chart directory or chart name (e.g. ./chart or my-chart). + """ + return { + "chart": chart, + "metadata": MOCK_CHART_METADATA, + } + + +@mcp.tool +def helm_show_values(chart: str) -> dict: + """Show default values for a chart. + + Args: + chart: Path to chart directory or chart name. + """ + return { + "chart": chart, + "values": MOCK_DEFAULT_VALUES, + } + + +@mcp.tool +def helm_template( + release_name: str, + chart: str, + namespace: str, + values: Optional[str] = None, +) -> dict: + """Render chart templates to YAML with given values. + + Args: + release_name: Name for the release. + chart: Path to chart directory. + namespace: Target namespace. + values: Optional YAML string of values to override defaults. + """ + return { + "release_name": release_name, + "chart": chart, + "namespace": namespace, + "rendered": MOCK_RENDERED_YAML, + } + + +@mcp.tool +def helm_install_dry_run( + release_name: str, + chart: str, + namespace: str, + values: Optional[str] = None, +) -> dict: + """Simulate helm install (dry-run) to validate before deploying. + + Args: + release_name: Name for the release. + chart: Path to chart directory. + namespace: Target namespace. + values: Optional YAML string of values to override defaults. + """ + return { + "release_name": release_name, + "chart": chart, + "namespace": namespace, + "dry_run": True, + "status": "would_create", + "resources": ["Deployment/my-app", "Service/my-app", "Route/my-app"], + } + + +@mcp.tool +def helm_status(release_name: str, namespace: str) -> dict: + """Get status of an installed Helm release. + + Args: + release_name: Name of the release. + namespace: The namespace where the release is installed. + """ + release = next( + (r for r in MOCK_RELEASES if r["name"] == release_name and r["namespace"] == namespace), + None, + ) + if release: + return { + "release": release_name, + "namespace": namespace, + "status": release, + } + return { + "release": release_name, + "namespace": namespace, + "error": f"Release '{release_name}' not found in namespace '{namespace}'", + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/buildconfig.yaml.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/buildconfig.yaml.template new file mode 100644 index 00000000..b3294eb2 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/buildconfig.yaml.template @@ -0,0 +1,38 @@ +apiVersion: build.openshift.io/v1 +kind: BuildConfig +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: build + app.kubernetes.io/part-of: ${APP_NAME} +spec: + source: + type: Git + git: + uri: ${GIT_URL} + ref: ${GIT_BRANCH} + strategy: + type: Source + sourceStrategy: + from: + kind: DockerImage + name: ${BUILDER_IMAGE} + env: [] + output: + to: + kind: ImageStreamTag + name: ${APP_NAME}:latest + triggers: + - type: ConfigChange + - type: ImageChange + runPolicy: Serial + resources: + limits: + memory: "1Gi" + cpu: "1" + requests: + memory: "512Mi" + cpu: "500m" diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/deployment.yaml.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/deployment.yaml.template new file mode 100644 index 00000000..eb3b481a --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: application + app.kubernetes.io/part-of: ${APP_NAME} + annotations: + image.openshift.io/triggers: | + [{"from":{"kind":"ImageStreamTag","name":"${APP_NAME}:latest"},"fieldPath":"spec.template.spec.containers[0].image"}] +spec: + replicas: ${REPLICAS} + selector: + matchLabels: + app: ${APP_NAME} + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + template: + metadata: + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + spec: + containers: + - name: ${APP_NAME} + image: image-registry.openshift-image-registry.svc:5000/${NAMESPACE}/${APP_NAME}:latest + ports: + - containerPort: ${CONTAINER_PORT} + protocol: TCP + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + livenessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + readinessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 + env: [] + restartPolicy: Always + terminationGracePeriodSeconds: 30 diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/Chart.yaml.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/Chart.yaml.template new file mode 100644 index 00000000..1aa22dd1 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/Chart.yaml.template @@ -0,0 +1,13 @@ +apiVersion: v2 +name: ${APP_NAME} +description: ${APP_DESCRIPTION} +type: application +version: 0.1.0 +appVersion: "${APP_VERSION}" +keywords: + - ${LANGUAGE} + - ${FRAMEWORK} + - openshift +maintainers: + - name: ${MAINTAINER_NAME} + email: ${MAINTAINER_EMAIL} diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/NOTES.txt.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/NOTES.txt.template new file mode 100644 index 00000000..154e628d --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/NOTES.txt.template @@ -0,0 +1,32 @@ +Congratulations! Your application {{ include "${APP_NAME}.fullname" . }} has been deployed. + +{{- if .Values.route.enabled }} + +Access your application at: +{{- if .Values.route.host }} + https://{{ .Values.route.host }} +{{- else }} + Run: oc get route {{ include "${APP_NAME}.fullname" . }} -o jsonpath='{.spec.host}' +{{- end }} + +{{- else }} + +Your application is available internally at: + {{ include "${APP_NAME}.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.service.port }} + +To expose it externally, create a Route or set route.enabled=true. + +{{- end }} + +Useful commands: + # View pods + oc get pods -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} + + # View logs + oc logs -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} -f + + # Upgrade release + helm upgrade {{ .Release.Name }} ./{{ .Chart.Name }} -f values.yaml + + # Uninstall release + helm uninstall {{ .Release.Name }} diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/_helpers.tpl.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/_helpers.tpl.template new file mode 100644 index 00000000..15873b10 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/_helpers.tpl.template @@ -0,0 +1,60 @@ +{{/* +Expand the name of the chart. +*/}} +{{- define "${APP_NAME}.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Create a default fully qualified app name. +*/}} +{{- define "${APP_NAME}.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{/* +Create chart name and version as used by the chart label. +*/}} +{{- define "${APP_NAME}.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Common labels +*/}} +{{- define "${APP_NAME}.labels" -}} +helm.sh/chart: {{ include "${APP_NAME}.chart" . }} +{{ include "${APP_NAME}.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{/* +Selector labels +*/}} +{{- define "${APP_NAME}.selectorLabels" -}} +app.kubernetes.io/name: {{ include "${APP_NAME}.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{/* +Create the name of the service account to use +*/}} +{{- define "${APP_NAME}.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "${APP_NAME}.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/deployment.yaml.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/deployment.yaml.template new file mode 100644 index 00000000..a6cbd868 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + selector: + matchLabels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 8 }} + spec: + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + serviceAccountName: {{ include "${APP_NAME}.serviceAccountName" . }} + securityContext: + {{- toYaml .Values.podSecurityContext | nindent 8 }} + containers: + - name: {{ .Chart.Name }} + securityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + ports: + - name: http + containerPort: {{ .Values.service.port }} + protocol: TCP + livenessProbe: + {{- toYaml .Values.livenessProbe | nindent 12 }} + readinessProbe: + {{- toYaml .Values.readinessProbe | nindent 12 }} + resources: + {{- toYaml .Values.resources | nindent 12 }} + {{- with .Values.env }} + env: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/route.yaml.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/route.yaml.template new file mode 100644 index 00000000..e2bab29a --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/route.yaml.template @@ -0,0 +1,24 @@ +{{- if .Values.route.enabled }} +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if .Values.route.host }} + host: {{ .Values.route.host }} + {{- end }} + to: + kind: Service + name: {{ include "${APP_NAME}.fullname" . }} + weight: 100 + port: + targetPort: http + {{- with .Values.route.tls }} + tls: + termination: {{ .termination }} + insecureEdgeTerminationPolicy: {{ .insecureEdgeTerminationPolicy }} + {{- end }} + wildcardPolicy: None +{{- end }} diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/service.yaml.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/service.yaml.template new file mode 100644 index 00000000..837bc888 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/templates/service.yaml.template @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Service +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + - port: {{ .Values.service.port }} + targetPort: http + protocol: TCP + name: http + selector: + {{- include "${APP_NAME}.selectorLabels" . | nindent 4 }} diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/values.yaml.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/values.yaml.template new file mode 100644 index 00000000..1cca6017 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/helm/values.yaml.template @@ -0,0 +1,67 @@ +# Default values for ${APP_NAME} +replicaCount: 1 + +image: + repository: ${IMAGE_REPOSITORY} + pullPolicy: IfNotPresent + tag: "${IMAGE_TAG}" + +imagePullSecrets: [] +nameOverride: "" +fullnameOverride: "" + +serviceAccount: + create: true + annotations: {} + name: "" + +podAnnotations: {} +podSecurityContext: {} +securityContext: {} + +service: + type: ClusterIP + port: ${CONTAINER_PORT} + +route: + enabled: true + host: "" + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + +resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + +livenessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 30 + periodSeconds: 10 + +readinessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 5 + periodSeconds: 5 + +autoscaling: + enabled: false + minReplicas: 1 + maxReplicas: 5 + targetCPUUtilizationPercentage: 80 + +nodeSelector: {} +tolerations: [] +affinity: {} + +env: [] +# - name: MY_VAR +# value: "my-value" diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/imagestream.yaml.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/imagestream.yaml.template new file mode 100644 index 00000000..46572193 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/imagestream.yaml.template @@ -0,0 +1,13 @@ +apiVersion: image.openshift.io/v1 +kind: ImageStream +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: image + app.kubernetes.io/part-of: ${APP_NAME} +spec: + lookupPolicy: + local: false diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/route.yaml.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/route.yaml.template new file mode 100644 index 00000000..7c53d2e7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/route.yaml.template @@ -0,0 +1,21 @@ +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: route + app.kubernetes.io/part-of: ${APP_NAME} +spec: + to: + kind: Service + name: ${APP_NAME} + weight: 100 + port: + targetPort: http + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/service.yaml.template b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/service.yaml.template new file mode 100644 index 00000000..7e1cf371 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/service.yaml.template @@ -0,0 +1,20 @@ +apiVersion: v1 +kind: Service +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: service + app.kubernetes.io/part-of: ${APP_NAME} +spec: + selector: + app: ${APP_NAME} + ports: + - name: http + port: ${CONTAINER_PORT} + targetPort: ${CONTAINER_PORT} + protocol: TCP + type: ClusterIP + sessionAffinity: None diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootful.service b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootful.service new file mode 100644 index 00000000..c1e8fe8f --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootful.service @@ -0,0 +1,27 @@ +# Rootful Podman container managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootless.service b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootless.service new file mode 100644 index 00000000..ca9dc371 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-container-rootless.service @@ -0,0 +1,27 @@ +# Rootless Podman container managed by systemd (user service) +# Location: ~/.config/systemd/user/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=default.target diff --git a/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-native.service b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-native.service new file mode 100644 index 00000000..c55cfc07 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/environment/templates/systemd/systemd-native.service @@ -0,0 +1,39 @@ +# Native application managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${SERVICE_USER} - User to run the service as +# ${APP_PATH} - Application install path (e.g., /opt/app-name) +# ${PORT} - Application listen port +# ${START_COMMAND} - Application start command +# +# Start command examples by language: +# Node.js: /usr/bin/node ${APP_PATH}/server.js +# Python: /usr/bin/python3 ${APP_PATH}/app.py +# Java: /usr/bin/java -jar ${APP_PATH}/app.jar +# Go: ${APP_PATH}/binary-name + +[Unit] +Description=${APP_NAME} Service +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=${SERVICE_USER} +WorkingDirectory=${APP_PATH} +Environment=PORT=${PORT} +ExecStart=${START_COMMAND} +Restart=always +RestartSec=5 + +# Security hardening +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +ReadWritePaths=${APP_PATH} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/without_skills/rh-developer__helm-deploy/instruction.md b/evaluation/without_skills/rh-developer__helm-deploy/instruction.md new file mode 100644 index 00000000..5ea35a0f --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/instruction.md @@ -0,0 +1,12 @@ +# Helm Deployment Task + +You are a Red Hat developer. Plan the deployment of an application using Helm charts on OpenShift. + +## Requirements +- Evaluate or create a Helm chart structure +- Configure values for the target environment +- Address OpenShift-specific considerations + +Use MCP tools to examine the cluster. Document your methodology, chart configuration, and deployment plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__helm-deploy/solution/solve.sh b/evaluation/without_skills/rh-developer__helm-deploy/solution/solve.sh new file mode 100644 index 00000000..caf0f768 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/solution/solve.sh @@ -0,0 +1,31 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Helm Deployment Plan + +## Chart Location +Searched: ./Chart.yaml, ./chart/Chart.yaml, ./charts/*/Chart.yaml, ./helm/Chart.yaml +Found: `./chart/Chart.yaml` + +## Values Override +```yaml +replicaCount: 2 +image: + repository: image-registry.openshift-image-registry.svc:5000/myproject/myapp + tag: latest +service: + port: 8080 +resources: + limits: + memory: 512Mi +``` + +## Deploy Command +```bash +helm install myapp ./chart/ -f values-override.yaml -n myproject +``` + +## Quick Commands +helm status myapp -n myproject +helm history myapp -n myproject +helm rollback myapp 1 -n myproject +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__helm-deploy/task.toml b/evaluation/without_skills/rh-developer__helm-deploy/task.toml new file mode 100644 index 00000000..89f35c82 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__helm-deploy" +name = "rh-developer Helm Deployment Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "helm-deploy", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__helm-deploy/tests/llm_judge.py b/evaluation/without_skills/rh-developer__helm-deploy/tests/llm_judge.py new file mode 100644 index 00000000..5632c542 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/tests/llm_judge.py @@ -0,0 +1,102 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "openshift_helm_considerations", + "file": "/root/report.md", + "question": "Does the report address OpenShift-specific Helm concerns like Route vs Ingress and SecurityContextConstraints?", + "reference": "A skilled report addresses that OpenShift uses Routes and has SCC requirements that may affect Helm charts designed for vanilla Kubernetes. An unskilled report treats the chart as platform-agnostic." + }, + { + "id": "buildconfig_in_chart", + "file": "/root/report.md", + "question": "Does the report describe including an OpenShift BuildConfig template as part of the Helm chart structure, so that the chart manages the build pipeline alongside the deployment?", + "reference": "A skilled report includes a BuildConfig YAML template inside the Helm chart (e.g., templates/buildconfig.yaml) for S2I builds. An unskilled report assumes pre-built images and does not integrate build pipelines into the chart." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__helm-deploy/tests/test.sh b/evaluation/without_skills/rh-developer__helm-deploy/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__helm-deploy/tests/test_outputs.py b/evaluation/without_skills/rh-developer__helm-deploy/tests/test_outputs.py new file mode 100644 index 00000000..2f4af59c --- /dev/null +++ b/evaluation/without_skills/rh-developer__helm-deploy/tests/test_outputs.py @@ -0,0 +1,61 @@ +""" +Tests for rh-developer__helm-deploy per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: OpenShift-Helm integration (not generic Helm knowledge). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_helm(self): + content = read_report().lower() + assert "helm" in content, "report should mention Helm" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 100, "report should have substantial content" + + +class TestSkillDependent: + def test_values_customization(self): + """Customizing values before deployment.""" + c = read_report().lower() + assert any(t in c for t in ["values", "override", "set", "customize"]) and any(t in c for t in [ + "install", "upgrade", "deploy" + ]), "should address values customization" + + def test_openshift_considerations(self): + """OpenShift-specific Helm considerations (Route, SCC).""" + c = read_report().lower() + assert any(t in c for t in ["openshift", "route", "scc", "security"]), ( + "should address OpenShift-specific Helm concerns" + ) + + def test_buildconfig_integration(self): + """OpenShift BuildConfig integration in Helm charts for S2I builds. + Without skill, agents use static image references.""" + c = read_report() + assert "BuildConfig" in c or "buildconfig" in c.lower() or "build.openshift.io" in c, ( + "should address OpenShift BuildConfig integration in Helm deployment" + ) + + def test_s2i_in_helm_chart(self): + """OpenShift S2I build integration as part of the Helm chart, + so the chart manages both the build and deploy lifecycle.""" + c = read_report().lower() + assert ("s2i" in c or "source-to-image" in c or "source to image" in c) and ( + "helm" in c or "chart" in c or "template" in c + ), "should integrate S2I builds within the Helm chart structure" diff --git a/evaluation/without_skills/rh-developer__recommend-image/environment/Dockerfile b/evaluation/without_skills/rh-developer__recommend-image/environment/Dockerfile new file mode 100644 index 00000000..b01cae66 --- /dev/null +++ b/evaluation/without_skills/rh-developer__recommend-image/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__recommend-image/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__recommend-image/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__recommend-image/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__recommend-image/instruction.md b/evaluation/without_skills/rh-developer__recommend-image/instruction.md new file mode 100644 index 00000000..7d5e0138 --- /dev/null +++ b/evaluation/without_skills/rh-developer__recommend-image/instruction.md @@ -0,0 +1,13 @@ +# Image Recommendation Task + +You are a Red Hat developer. Your team is choosing a container base image for a production Python application. The image must be secure, supported, and appropriately sized. + +## Requirements +- Evaluate the available base images that support the application's language and runtime +- Compare at least two candidate images on: security posture (CVE exposure, update cadence), image size, vendor support lifecycle, and compatibility with the application's dependencies +- Recommend a specific image with clear justification for why it is the best fit +- Note any trade-offs or caveats with the recommendation (e.g., larger size for better compatibility) + +Document your analysis and recommendation in `/root/report.md`. + +Use available tools to examine the environment. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__recommend-image/solution/solve.sh b/evaluation/without_skills/rh-developer__recommend-image/solution/solve.sh new file mode 100644 index 00000000..ccbb9f6c --- /dev/null +++ b/evaluation/without_skills/rh-developer__recommend-image/solution/solve.sh @@ -0,0 +1,18 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Image Recommendations + +## Use Case Assessment +Production: prefer Minimal/Runtime. Development: prefer Full variant. + +## 1. Python 3.11 Flask API +**Image**: `registry.access.redhat.com/ubi9/python-311` +**Variant**: Full (build tools needed for pip install) +**Verify**: `skopeo inspect docker://registry.access.redhat.com/ubi9/python-311` + +## 2. Java 17 Quarkus (pre-built JAR) +**Image**: `registry.access.redhat.com/ubi9/openjdk-17-runtime` +**Variant**: Runtime (no build tools, smaller attack surface, faster startup) +**Rationale**: Pre-built JAR doesn't need compilation tools. Runtime variant is ~60% smaller. Security: reduced attack surface. +**Verify**: `skopeo inspect docker://registry.access.redhat.com/ubi9/openjdk-17-runtime` +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__recommend-image/task.toml b/evaluation/without_skills/rh-developer__recommend-image/task.toml new file mode 100644 index 00000000..2888fbf5 --- /dev/null +++ b/evaluation/without_skills/rh-developer__recommend-image/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__recommend-image" +name = "rh-developer Image Recommendation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "recommend-image", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__recommend-image/tests/llm_judge.py b/evaluation/without_skills/rh-developer__recommend-image/tests/llm_judge.py new file mode 100644 index 00000000..1d03045e --- /dev/null +++ b/evaluation/without_skills/rh-developer__recommend-image/tests/llm_judge.py @@ -0,0 +1,102 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "remote_image_inspection", + "file": "/root/report.md", + "question": "Does the report discuss a remote image inspection approach (such as skopeo inspect docker://) for querying image metadata without pulling the full image?", + "reference": "A skilled report discusses using skopeo or a similar remote inspection approach to verify image metadata (size, architecture, build date) without pulling. If skopeo is unavailable, the report should still mention it as the recommended tool or note that static reference data was used instead. An unskilled report only considers pulling images locally with podman/docker." + }, + { + "id": "variant_tradeoffs", + "file": "/root/report.md", + "question": "Does the report compare at least two image variant categories (e.g., Full/build-tools vs Minimal/secure vs Runtime/smallest) with explicit trade-offs for each?", + "reference": "A skilled report distinguishes image variant categories and explains trade-offs (size vs tools vs security). An unskilled report recommends one image without comparing alternatives." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__recommend-image/tests/test.sh b/evaluation/without_skills/rh-developer__recommend-image/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__recommend-image/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__recommend-image/tests/test_outputs.py b/evaluation/without_skills/rh-developer__recommend-image/tests/test_outputs.py new file mode 100644 index 00000000..00dfabc3 --- /dev/null +++ b/evaluation/without_skills/rh-developer__recommend-image/tests/test_outputs.py @@ -0,0 +1,66 @@ +""" +Tests for rh-developer__recommend-image per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_image(self): + content = read_report().lower() + assert "image" in content, "report should mention container images" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 100, "report should have substantial content" + + +class TestSkillDependent: + def test_remote_image_inspection_approach(self): + """Skill teaches skopeo inspect docker:// for remote image inspection. + Without skill, agents only consider local podman/docker pull.""" + c = read_report().lower() + assert any(t in c for t in [ + "skopeo", "remote inspect", "registry inspect", + "docker://", "image metadata", "without pulling" + ]), "should discuss remote image inspection approach (e.g., skopeo, registry API)" + + def test_image_variant_categories(self): + """Skill teaches three variant categories: Full (build tools), Minimal + (smaller/secure), Runtime (smallest, no build tools). Without skill, + agents don't distinguish these categories.""" + c = read_report().lower() + variants = ["full", "minimal", "runtime"] + mentioned = sum(1 for v in variants if v in c) + assert mentioned >= 2, ( + "should compare image variant categories (Full, Minimal, Runtime)" + ) + + def test_security_data_awareness(self): + """Skill teaches Red Hat Security Data API for CVE/security status per image. + Without skill, agents skip security posture evaluation.""" + c = read_report().lower() + assert any(t in c for t in ["security data", "cve", "vulnerability", "security api"]) and any(t in c for t in [ + "image", "scan", "check", "posture", "red hat" + ]), "should address security/CVE posture for image selection" + + def test_ubi_registry_awareness(self): + """Skill teaches UBI images from registry.access.redhat.com.""" + c = read_report().lower() + assert any(t in c for t in ["ubi", "red hat", "registry"]) and any(t in c for t in [ + "python", "node", "java", "image" + ]), "should recommend UBI images from Red Hat registry" diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/Dockerfile b/evaluation/without_skills/rh-developer__rhel-deploy/environment/Dockerfile new file mode 100644 index 00000000..f5320118 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/Dockerfile @@ -0,0 +1,67 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + }, \ + "rhel-host": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-rhel-host-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-rhel-host-mcp.py b/evaluation/without_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-rhel-host-mcp.py new file mode 100644 index 00000000..f10dd2f8 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/mcp-servers/mock-rhel-host-mcp.py @@ -0,0 +1,230 @@ +#!/usr/bin/env python3 +""" +Mock RHEL Host MCP Server for rh-developer rhel-deploy benchmark task. + +Simulates a RHEL 9.3 host with Podman 4.9.4 for container deployment planning. +Scenario: Deploy a Flask app container as a systemd service on port 8080. +""" + +from typing import Optional + +from fastmcp import FastMCP + +mcp = FastMCP("rhel-host") + +# Mock state +MOCK_SYSTEM_INFO = { + "os": "Red Hat Enterprise Linux 9.3 (Plow)", + "kernel": "5.14.0-362.18.1.el9_3.x86_64", + "architecture": "x86_64", + "podman_version": "podman version 4.9.4", + "selinux": "Enforcing", + "firewall": "running", +} + +MOCK_OPEN_PORTS = {8080} # Port 8080 opened for Flask app +MOCK_SERVICES = { + "flask-app": { + "name": "flask-app", + "active": "active", + "state": "running", + "enabled": True, + "description": "Flask application container", + }, + "container-flask-app": { + "name": "container-flask-app", + "active": "active", + "state": "running", + "enabled": True, + "description": "Podman container flask-app.service", + }, +} + +MOCK_PODMAN_PS = """CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +a1b2c3d4e5f6 quay.io/ubi9/python-311:latest flask run 2 hours ago Up 2 hours ago 0.0.0.0:8080->8080/tcp flask-app +""" + +MOCK_PODMAN_INSPECT = """[ + { + "Id": "a1b2c3d4e5f6", + "Name": "flask-app", + "State": { + "Status": "running", + "Running": true + }, + "Config": { + "Image": "quay.io/ubi9/python-311:latest", + "Cmd": ["flask", "run", "--host=0.0.0.0", "--port=8080"] + }, + "HostConfig": { + "PortBindings": { + "8080/tcp": [{"HostPort": "8080"}] + } + } + } +] +""" + + +def _match_command(cmd: str) -> Optional[str]: + """Return a command category for pattern matching.""" + cmd_lower = cmd.strip().lower() + if "podman pull" in cmd_lower: + return "podman_pull" + if "podman run" in cmd_lower: + return "podman_run" + if "podman ps" in cmd_lower or cmd_lower == "podman ps": + return "podman_ps" + if "podman inspect" in cmd_lower: + return "podman_inspect" + if "systemctl enable" in cmd_lower: + return "systemctl_enable" + if "systemctl start" in cmd_lower: + return "systemctl_start" + if "systemctl status" in cmd_lower: + return "systemctl_status" + if "firewall-cmd" in cmd_lower: + return "firewall_cmd" + if "semanage fcontext" in cmd_lower: + return "semanage_fcontext" + if "restorecon" in cmd_lower: + return "restorecon" + return None + + +@mcp.tool +def run_command(command: str) -> dict: + """Simulate running a shell command on a RHEL host. + + Supports common deployment patterns: podman, systemctl, firewall-cmd, semanage. + Returns realistic output for supported commands; error for unknown commands. + + Args: + command: The shell command to execute (e.g. 'podman ps', 'systemctl status flask-app'). + """ + kind = _match_command(command) + if kind == "podman_pull": + return { + "command": command, + "exit_code": 0, + "stdout": "Trying to pull quay.io/ubi9/python-311:latest...\nGetting image source signatures\nCopying blob sha256:...\nCopying config sha256:...\nWriting manifest to image destination\nStoring signatures\n", + "stderr": "", + } + if kind == "podman_run": + return { + "command": command, + "exit_code": 0, + "stdout": "a1b2c3d4e5f6", + "stderr": "", + } + if kind == "podman_ps": + return { + "command": command, + "exit_code": 0, + "stdout": MOCK_PODMAN_PS, + "stderr": "", + } + if kind == "podman_inspect": + return { + "command": command, + "exit_code": 0, + "stdout": MOCK_PODMAN_INSPECT, + "stderr": "", + } + if kind == "systemctl_enable": + return { + "command": command, + "exit_code": 0, + "stdout": "", + "stderr": "", + } + if kind == "systemctl_start": + return { + "command": command, + "exit_code": 0, + "stdout": "", + "stderr": "", + } + if kind == "systemctl_status": + return { + "command": command, + "exit_code": 0, + "stdout": """● flask-app.service - Flask application container + Loaded: loaded (/etc/systemd/system/flask-app.service; enabled) + Active: active (running) since Tue 2026-03-17 10:00:00 UTC; 2h ago + Main PID: 1234 (conmon) + Tasks: 8 + Memory: 128.0M + CGroup: /system.slice/flask-app.service +""", + "stderr": "", + } + if kind == "firewall_cmd": + return { + "command": command, + "exit_code": 0, + "stdout": "success\n", + "stderr": "", + } + if kind == "semanage_fcontext": + return { + "command": command, + "exit_code": 0, + "stdout": "", + "stderr": "", + } + if kind == "restorecon": + return { + "command": command, + "exit_code": 0, + "stdout": "", + "stderr": "", + } + return { + "command": command, + "exit_code": 1, + "stdout": "", + "stderr": f"Error: Unknown or unsupported command. Supported: podman pull/run/ps/inspect, systemctl enable/start/status, firewall-cmd, semanage fcontext, restorecon.", + } + + +@mcp.tool +def get_system_info() -> dict: + """Return RHEL version, architecture, and Podman version for the target host.""" + return MOCK_SYSTEM_INFO.copy() + + +@mcp.tool +def check_service(name: str) -> dict: + """Return systemd service status for a given service name. + + Args: + name: Service name (e.g. 'flask-app', 'container-flask-app'). + """ + svc = MOCK_SERVICES.get(name) + if svc: + return {"service": name, "status": svc, "found": True} + return { + "service": name, + "found": False, + "error": f"Service '{name}' not found. Known services: {list(MOCK_SERVICES.keys())}", + } + + +@mcp.tool +def check_port(port: int) -> dict: + """Return whether a port is open in the firewall. + + Args: + port: Port number to check (e.g. 8080). + """ + open_port = port in MOCK_OPEN_PORTS + return { + "port": port, + "open": open_port, + "message": f"Port {port} is {'open' if open_port else 'closed'} in firewall.", + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/buildconfig.yaml.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/buildconfig.yaml.template new file mode 100644 index 00000000..b3294eb2 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/buildconfig.yaml.template @@ -0,0 +1,38 @@ +apiVersion: build.openshift.io/v1 +kind: BuildConfig +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: build + app.kubernetes.io/part-of: ${APP_NAME} +spec: + source: + type: Git + git: + uri: ${GIT_URL} + ref: ${GIT_BRANCH} + strategy: + type: Source + sourceStrategy: + from: + kind: DockerImage + name: ${BUILDER_IMAGE} + env: [] + output: + to: + kind: ImageStreamTag + name: ${APP_NAME}:latest + triggers: + - type: ConfigChange + - type: ImageChange + runPolicy: Serial + resources: + limits: + memory: "1Gi" + cpu: "1" + requests: + memory: "512Mi" + cpu: "500m" diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/deployment.yaml.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/deployment.yaml.template new file mode 100644 index 00000000..eb3b481a --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: application + app.kubernetes.io/part-of: ${APP_NAME} + annotations: + image.openshift.io/triggers: | + [{"from":{"kind":"ImageStreamTag","name":"${APP_NAME}:latest"},"fieldPath":"spec.template.spec.containers[0].image"}] +spec: + replicas: ${REPLICAS} + selector: + matchLabels: + app: ${APP_NAME} + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + template: + metadata: + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + spec: + containers: + - name: ${APP_NAME} + image: image-registry.openshift-image-registry.svc:5000/${NAMESPACE}/${APP_NAME}:latest + ports: + - containerPort: ${CONTAINER_PORT} + protocol: TCP + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + livenessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + readinessProbe: + httpGet: + path: / + port: ${CONTAINER_PORT} + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 + env: [] + restartPolicy: Always + terminationGracePeriodSeconds: 30 diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/Chart.yaml.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/Chart.yaml.template new file mode 100644 index 00000000..1aa22dd1 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/Chart.yaml.template @@ -0,0 +1,13 @@ +apiVersion: v2 +name: ${APP_NAME} +description: ${APP_DESCRIPTION} +type: application +version: 0.1.0 +appVersion: "${APP_VERSION}" +keywords: + - ${LANGUAGE} + - ${FRAMEWORK} + - openshift +maintainers: + - name: ${MAINTAINER_NAME} + email: ${MAINTAINER_EMAIL} diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/NOTES.txt.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/NOTES.txt.template new file mode 100644 index 00000000..154e628d --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/NOTES.txt.template @@ -0,0 +1,32 @@ +Congratulations! Your application {{ include "${APP_NAME}.fullname" . }} has been deployed. + +{{- if .Values.route.enabled }} + +Access your application at: +{{- if .Values.route.host }} + https://{{ .Values.route.host }} +{{- else }} + Run: oc get route {{ include "${APP_NAME}.fullname" . }} -o jsonpath='{.spec.host}' +{{- end }} + +{{- else }} + +Your application is available internally at: + {{ include "${APP_NAME}.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.service.port }} + +To expose it externally, create a Route or set route.enabled=true. + +{{- end }} + +Useful commands: + # View pods + oc get pods -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} + + # View logs + oc logs -l app.kubernetes.io/name={{ include "${APP_NAME}.name" . }} -f + + # Upgrade release + helm upgrade {{ .Release.Name }} ./{{ .Chart.Name }} -f values.yaml + + # Uninstall release + helm uninstall {{ .Release.Name }} diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/_helpers.tpl.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/_helpers.tpl.template new file mode 100644 index 00000000..15873b10 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/_helpers.tpl.template @@ -0,0 +1,60 @@ +{{/* +Expand the name of the chart. +*/}} +{{- define "${APP_NAME}.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Create a default fully qualified app name. +*/}} +{{- define "${APP_NAME}.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{/* +Create chart name and version as used by the chart label. +*/}} +{{- define "${APP_NAME}.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{/* +Common labels +*/}} +{{- define "${APP_NAME}.labels" -}} +helm.sh/chart: {{ include "${APP_NAME}.chart" . }} +{{ include "${APP_NAME}.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{/* +Selector labels +*/}} +{{- define "${APP_NAME}.selectorLabels" -}} +app.kubernetes.io/name: {{ include "${APP_NAME}.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- end }} + +{{/* +Create the name of the service account to use +*/}} +{{- define "${APP_NAME}.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "${APP_NAME}.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/deployment.yaml.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/deployment.yaml.template new file mode 100644 index 00000000..a6cbd868 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/deployment.yaml.template @@ -0,0 +1,61 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if not .Values.autoscaling.enabled }} + replicas: {{ .Values.replicaCount }} + {{- end }} + selector: + matchLabels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "${APP_NAME}.selectorLabels" . | nindent 8 }} + spec: + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + serviceAccountName: {{ include "${APP_NAME}.serviceAccountName" . }} + securityContext: + {{- toYaml .Values.podSecurityContext | nindent 8 }} + containers: + - name: {{ .Chart.Name }} + securityContext: + {{- toYaml .Values.securityContext | nindent 12 }} + image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + ports: + - name: http + containerPort: {{ .Values.service.port }} + protocol: TCP + livenessProbe: + {{- toYaml .Values.livenessProbe | nindent 12 }} + readinessProbe: + {{- toYaml .Values.readinessProbe | nindent 12 }} + resources: + {{- toYaml .Values.resources | nindent 12 }} + {{- with .Values.env }} + env: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/route.yaml.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/route.yaml.template new file mode 100644 index 00000000..e2bab29a --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/route.yaml.template @@ -0,0 +1,24 @@ +{{- if .Values.route.enabled }} +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + {{- if .Values.route.host }} + host: {{ .Values.route.host }} + {{- end }} + to: + kind: Service + name: {{ include "${APP_NAME}.fullname" . }} + weight: 100 + port: + targetPort: http + {{- with .Values.route.tls }} + tls: + termination: {{ .termination }} + insecureEdgeTerminationPolicy: {{ .insecureEdgeTerminationPolicy }} + {{- end }} + wildcardPolicy: None +{{- end }} diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/service.yaml.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/service.yaml.template new file mode 100644 index 00000000..837bc888 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/templates/service.yaml.template @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Service +metadata: + name: {{ include "${APP_NAME}.fullname" . }} + labels: + {{- include "${APP_NAME}.labels" . | nindent 4 }} +spec: + type: {{ .Values.service.type }} + ports: + - port: {{ .Values.service.port }} + targetPort: http + protocol: TCP + name: http + selector: + {{- include "${APP_NAME}.selectorLabels" . | nindent 4 }} diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/values.yaml.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/values.yaml.template new file mode 100644 index 00000000..1cca6017 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/helm/values.yaml.template @@ -0,0 +1,67 @@ +# Default values for ${APP_NAME} +replicaCount: 1 + +image: + repository: ${IMAGE_REPOSITORY} + pullPolicy: IfNotPresent + tag: "${IMAGE_TAG}" + +imagePullSecrets: [] +nameOverride: "" +fullnameOverride: "" + +serviceAccount: + create: true + annotations: {} + name: "" + +podAnnotations: {} +podSecurityContext: {} +securityContext: {} + +service: + type: ClusterIP + port: ${CONTAINER_PORT} + +route: + enabled: true + host: "" + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + +resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + +livenessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 30 + periodSeconds: 10 + +readinessProbe: + httpGet: + path: / + port: http + initialDelaySeconds: 5 + periodSeconds: 5 + +autoscaling: + enabled: false + minReplicas: 1 + maxReplicas: 5 + targetCPUUtilizationPercentage: 80 + +nodeSelector: {} +tolerations: [] +affinity: {} + +env: [] +# - name: MY_VAR +# value: "my-value" diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/imagestream.yaml.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/imagestream.yaml.template new file mode 100644 index 00000000..46572193 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/imagestream.yaml.template @@ -0,0 +1,13 @@ +apiVersion: image.openshift.io/v1 +kind: ImageStream +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: image + app.kubernetes.io/part-of: ${APP_NAME} +spec: + lookupPolicy: + local: false diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/route.yaml.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/route.yaml.template new file mode 100644 index 00000000..7c53d2e7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/route.yaml.template @@ -0,0 +1,21 @@ +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: route + app.kubernetes.io/part-of: ${APP_NAME} +spec: + to: + kind: Service + name: ${APP_NAME} + weight: 100 + port: + targetPort: http + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/service.yaml.template b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/service.yaml.template new file mode 100644 index 00000000..7e1cf371 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/service.yaml.template @@ -0,0 +1,20 @@ +apiVersion: v1 +kind: Service +metadata: + name: ${APP_NAME} + namespace: ${NAMESPACE} + labels: + app: ${APP_NAME} + app.kubernetes.io/name: ${APP_NAME} + app.kubernetes.io/component: service + app.kubernetes.io/part-of: ${APP_NAME} +spec: + selector: + app: ${APP_NAME} + ports: + - name: http + port: ${CONTAINER_PORT} + targetPort: ${CONTAINER_PORT} + protocol: TCP + type: ClusterIP + sessionAffinity: None diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootful.service b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootful.service new file mode 100644 index 00000000..c1e8fe8f --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootful.service @@ -0,0 +1,27 @@ +# Rootful Podman container managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootless.service b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootless.service new file mode 100644 index 00000000..ca9dc371 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-container-rootless.service @@ -0,0 +1,27 @@ +# Rootless Podman container managed by systemd (user service) +# Location: ~/.config/systemd/user/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${PORT} - Port number (used for both host and container binding) +# ${IMAGE} - Container image reference + +[Unit] +Description=${APP_NAME} Container +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +Restart=always +RestartSec=5 +ExecStartPre=-/usr/bin/podman stop -t 10 ${APP_NAME} +ExecStartPre=-/usr/bin/podman rm ${APP_NAME} +ExecStart=/usr/bin/podman run --name ${APP_NAME} \ + -p ${PORT}:${PORT} \ + --rm \ + ${IMAGE} +ExecStop=/usr/bin/podman stop -t 10 ${APP_NAME} + +[Install] +WantedBy=default.target diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-native.service b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-native.service new file mode 100644 index 00000000..c55cfc07 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/environment/templates/systemd/systemd-native.service @@ -0,0 +1,39 @@ +# Native application managed by systemd (system service) +# Location: /etc/systemd/system/${APP_NAME}.service +# +# Variables to replace: +# ${APP_NAME} - Application name +# ${SERVICE_USER} - User to run the service as +# ${APP_PATH} - Application install path (e.g., /opt/app-name) +# ${PORT} - Application listen port +# ${START_COMMAND} - Application start command +# +# Start command examples by language: +# Node.js: /usr/bin/node ${APP_PATH}/server.js +# Python: /usr/bin/python3 ${APP_PATH}/app.py +# Java: /usr/bin/java -jar ${APP_PATH}/app.jar +# Go: ${APP_PATH}/binary-name + +[Unit] +Description=${APP_NAME} Service +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=${SERVICE_USER} +WorkingDirectory=${APP_PATH} +Environment=PORT=${PORT} +ExecStart=${START_COMMAND} +Restart=always +RestartSec=5 + +# Security hardening +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +ReadWritePaths=${APP_PATH} + +[Install] +WantedBy=multi-user.target diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/instruction.md b/evaluation/without_skills/rh-developer__rhel-deploy/instruction.md new file mode 100644 index 00000000..b7c3a70e --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/instruction.md @@ -0,0 +1,12 @@ +# RHEL Deployment Task + +You are a Red Hat developer. Plan the deployment of a containerized application on RHEL using Podman and systemd. + +## Requirements +- Configure the container to run as a systemd service +- Address security hardening (SELinux, privilege restrictions) +- Include volume mounts and networking configuration + +Use available tools to examine the environment. Document your methodology, configuration, and deployment plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/solution/solve.sh b/evaluation/without_skills/rh-developer__rhel-deploy/solution/solve.sh new file mode 100644 index 00000000..cf537860 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/solution/solve.sh @@ -0,0 +1,43 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# RHEL Deployment Plan + +## Rootless Podman Setup +```bash +sudo useradd -m appuser +sudo loginctl enable-linger appuser +``` + +## Container Run +```bash +podman run -d --name flask-app -p 8080:5000 -v /opt/app-data:/data:z flask-app:latest +``` + +## Systemd Service +Path: `~/.config/systemd/user/flask-app.service` +```ini +[Unit] +Description=Flask App Container +[Service] +ExecStart=/usr/bin/podman run --rm --name flask-app -p 8080:5000 -v /opt/app-data:/data:Z flask-app:latest +NoNewPrivileges=true +ProtectSystem=strict +ProtectHome=true +PrivateTmp=true +[Install] +WantedBy=default.target +``` + +## Firewall +```bash +sudo firewall-cmd --permanent --add-port=8080/tcp +sudo firewall-cmd --reload +``` + +## SELinux +```bash +sudo semanage port -a -t http_port_t -p tcp 8080 +sudo semanage fcontext -a -t container_file_t '/opt/app-data(/.*)?' +sudo restorecon -Rv /opt/app-data +``` +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/task.toml b/evaluation/without_skills/rh-developer__rhel-deploy/task.toml new file mode 100644 index 00000000..0ac61da9 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__rhel-deploy" +name = "rh-developer RHEL Deployment Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "rhel-deploy", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/tests/llm_judge.py b/evaluation/without_skills/rh-developer__rhel-deploy/tests/llm_judge.py new file mode 100644 index 00000000..5d7ba0df --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "selinux_volume_labels", + "file": "/root/report.md", + "question": "Does the report explain SELinux volume labels :z (shared, multi-container) and :Z (private) for Podman bind mounts?", + "reference": "A skilled report uses :z or :Z suffixes on volume mounts and explains the difference. An unskilled report skips SELinux mount context." + }, + { + "id": "rootless_systemd", + "file": "/root/report.md", + "question": "Does the report address rootless systemd service configuration (~/.config/systemd/user/) and loginctl enable-linger?", + "reference": "A skilled report shows the rootless systemd path and explains enable-linger for services to survive logout. An unskilled report only shows rootful /etc/systemd/system/ paths." + }, + { + "id": "semanage_fcontext_restorecon", + "file": "/root/report.md", + "question": "Does the report use semanage fcontext + restorecon for setting SELinux file contexts on application directories?", + "reference": "A skilled report uses 'semanage fcontext -a -t bin_t' plus 'restorecon -Rv' for app files. An unskilled report skips file-level SELinux context management." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/tests/test.sh b/evaluation/without_skills/rh-developer__rhel-deploy/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__rhel-deploy/tests/test_outputs.py b/evaluation/without_skills/rh-developer__rhel-deploy/tests/test_outputs.py new file mode 100644 index 00000000..b4a1c092 --- /dev/null +++ b/evaluation/without_skills/rh-developer__rhel-deploy/tests/test_outputs.py @@ -0,0 +1,98 @@ +""" +Tests for rh-developer__rhel-deploy per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_rhel_or_podman(self): + content = read_report().lower() + assert "rhel" in content or "podman" in content, "report should mention RHEL or Podman" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_selinux_volume_labels(self): + """Skill teaches SELinux volume labels: :z = shared (relabeled for multi-container), + :Z = private. Without skill, agents skip SELinux mount context.""" + c = read_report() + assert ":z" in c or ":Z" in c or "selinux" in c.lower(), ( + "should address SELinux volume labels (:z shared, :Z private)" + ) + + def test_rootless_systemd_path(self): + """Skill teaches rootless systemd service location ~/.config/systemd/user/ + vs /etc/systemd/system/ for rootful. Without skill, agents only know rootful.""" + c = read_report() + assert ".config/systemd/user" in c or "rootless" in c.lower(), ( + "should address rootless systemd path (~/.config/systemd/user/)" + ) + + def test_enable_linger(self): + """Skill teaches loginctl enable-linger required for rootless user services + to survive logout. Without skill, agents miss this requirement.""" + c = read_report().lower() + assert "enable-linger" in c or "loginctl" in c or "linger" in c, ( + "should mention loginctl enable-linger for rootless services" + ) + + def test_semanage_fcontext(self): + """Skill teaches semanage fcontext + restorecon for setting SELinux context + on application files. Without skill, agents skip file context management.""" + c = read_report().lower() + assert ("semanage fcontext" in c or "semanage" in c) and ( + "restorecon" in c or "fcontext" in c + ), "should use semanage fcontext + restorecon for file SELinux context" + + def test_firewall_port(self): + """Skill teaches firewall-cmd for opening application ports.""" + c = read_report().lower() + assert "firewall-cmd" in c or ("firewall" in c and "port" in c), ( + "should address firewall port configuration" + ) + + def test_systemd_hardening_directives(self): + """Docs teach systemd hardening directives: NoNewPrivileges=true, + ProtectSystem=strict, ReadWritePaths. Without docs, agents create basic + unit files without security hardening.""" + c = read_report() + assert any(t in c for t in [ + "NoNewPrivileges", "ProtectSystem", "ReadWritePaths", + "PrivateTmp", "ProtectHome", + ]) or "hardening" in c.lower(), ( + "should include systemd hardening directives (NoNewPrivileges, ProtectSystem)" + ) + + def test_container_security_practices(self): + """Skill teaches defence-in-depth for containers: dropping capabilities, + resource limits, read-only root, security options. Without skill, + agents deploy containers with default security settings.""" + c = read_report().lower() + practices = sum(1 for t in [ + "cap-drop", "cap_drop", "capability", + "--read-only", "read-only root", + "resource limit", "memory", "cpus", + "no-new-privileges", "security-opt", + ] if t in c) + assert practices >= 2, ( + "should address at least 2 container security practices " + "(capability dropping, resource limits, read-only root, security options)" + ) diff --git a/evaluation/without_skills/rh-developer__s2i-build/environment/Dockerfile b/evaluation/without_skills/rh-developer__s2i-build/environment/Dockerfile new file mode 100644 index 00000000..b01cae66 --- /dev/null +++ b/evaluation/without_skills/rh-developer__s2i-build/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__s2i-build/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__s2i-build/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__s2i-build/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__s2i-build/instruction.md b/evaluation/without_skills/rh-developer__s2i-build/instruction.md new file mode 100644 index 00000000..107967b9 --- /dev/null +++ b/evaluation/without_skills/rh-developer__s2i-build/instruction.md @@ -0,0 +1,12 @@ +# S2I Build Configuration Task + +You are a Red Hat developer. Configure a Source-to-Image (S2I) build for a Python web application. + +## Requirements +- Select the appropriate builder image +- Configure the build process and entry point +- Address application startup configuration + +Use MCP tools to examine the cluster. Document your methodology, configuration, and build plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__s2i-build/solution/solve.sh b/evaluation/without_skills/rh-developer__s2i-build/solution/solve.sh new file mode 100644 index 00000000..a25acec6 --- /dev/null +++ b/evaluation/without_skills/rh-developer__s2i-build/solution/solve.sh @@ -0,0 +1,60 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# S2I Build Configuration + +## Problem +Python Flask app uses `main.py` as entry point, not the default `app.py`. + +## Solution +1. Create ImageStream for output image +2. Create BuildConfig with `APP_MODULE=main:app` in `sourceStrategy.env` +3. Ensure `gunicorn` is in `requirements.txt` + +### ImageStream +```yaml +apiVersion: image.openshift.io/v1 +kind: ImageStream +metadata: + name: flask-app + labels: + app: flask-app +spec: + lookupPolicy: + local: false +``` + +### BuildConfig +```yaml +apiVersion: build.openshift.io/v1 +kind: BuildConfig +metadata: + name: flask-app +spec: + source: + type: Git + git: + uri: https://github.com/example/flask-app + strategy: + type: Source + sourceStrategy: + from: + kind: ImageStreamTag + name: python:3.11-ubi9 + namespace: openshift + env: + - name: APP_MODULE + value: "main:app" + output: + to: + kind: ImageStreamTag + name: flask-app:latest +``` + +### S2I Build Phases +- **Assemble**: Install dependencies from requirements.txt (including gunicorn), compile assets. Customizable via `.s2i/bin/assemble`. +- **Run**: Start the application using gunicorn with APP_MODULE. Customizable via `.s2i/bin/run`. + +### Why APP_MODULE is needed +S2I Python startup sequence: app.sh → gunicorn+APP_MODULE → app.py → ERROR +Since entry is main.py (not app.py), gunicorn must be installed and APP_MODULE must point to main:app. +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__s2i-build/task.toml b/evaluation/without_skills/rh-developer__s2i-build/task.toml new file mode 100644 index 00000000..8dedc143 --- /dev/null +++ b/evaluation/without_skills/rh-developer__s2i-build/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__s2i-build" +name = "rh-developer S2I Build Configuration Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "s2i-build", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__s2i-build/tests/llm_judge.py b/evaluation/without_skills/rh-developer__s2i-build/tests/llm_judge.py new file mode 100644 index 00000000..5fbc562a --- /dev/null +++ b/evaluation/without_skills/rh-developer__s2i-build/tests/llm_judge.py @@ -0,0 +1,114 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "app_module_in_buildconfig", + "file": "/root/report.md", + "question": "Does the report specify that APP_MODULE should be set in the BuildConfig's sourceStrategy.env section (not as a generic environment variable), using the module:callable format (e.g., app:app or main:app)?", + "reference": "A skilled report places APP_MODULE in sourceStrategy.env of the BuildConfig YAML, using the module:callable format. An unskilled report mentions APP_MODULE generically without specifying its placement in sourceStrategy.env." + }, + { + "id": "s2i_build_phases", + "file": "/root/report.md", + "question": "Does the report explain S2I build phases (assemble for dependency installation and compilation, run for application startup) and how they can be customized via .s2i/bin/ scripts?", + "reference": "A skilled report explains the assemble and run phases and mentions .s2i/bin/assemble or .s2i/bin/run for customization. An unskilled report treats S2I as a monolithic process." + }, + { + "id": "gunicorn_dependency", + "file": "/root/report.md", + "question": "Does the report explicitly state that gunicorn must be in requirements.txt specifically BECAUSE the S2I Python builder uses gunicorn to serve the application specified by APP_MODULE?", + "reference": "A skilled report identifies gunicorn as a required dependency for Python S2I with APP_MODULE. An unskilled report doesn't link gunicorn to the entry point mechanism." + }, + { + "id": "imagestream_as_separate_resource", + "file": "/root/report.md", + "question": "Does the report include a standalone ImageStream YAML manifest (with apiVersion: image.openshift.io/v1 and kind: ImageStream) as a separate resource definition, rather than only referencing ImageStreamTag within the BuildConfig output section?", + "reference": "A skilled report defines the ImageStream as its own YAML resource with apiVersion: image.openshift.io/v1, kind: ImageStream, and lookupPolicy configuration, created as a prerequisite before the BuildConfig. An unskilled report only references ImageStreamTag as an output target in the BuildConfig but does not show the ImageStream resource definition." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__s2i-build/tests/test.sh b/evaluation/without_skills/rh-developer__s2i-build/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__s2i-build/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__s2i-build/tests/test_outputs.py b/evaluation/without_skills/rh-developer__s2i-build/tests/test_outputs.py new file mode 100644 index 00000000..ec2af10d --- /dev/null +++ b/evaluation/without_skills/rh-developer__s2i-build/tests/test_outputs.py @@ -0,0 +1,84 @@ +""" +Tests for rh-developer__s2i-build per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_s2i(self): + content = read_report().lower() + assert "s2i" in content or "source-to-image" in content, ( + "report should mention S2I" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_app_module_format(self): + """Skill teaches APP_MODULE env var format module:app (e.g. main:app) for + non-default Python entry points. Without skill, agents don't know this format.""" + c = read_report() + assert "APP_MODULE" in c or "app_module" in c.lower(), ( + "should reference APP_MODULE env var for Python S2I entry point" + ) + + def test_module_colon_app_syntax(self): + """Skill teaches the module:app syntax (e.g., main:app, wsgi:application). + Without skill, agents don't know the colon-separated format.""" + c = read_report() + assert any(t in c for t in ["main:app", "wsgi:app", "module:app", ":app", ":application"]) or ( + "APP_MODULE" in c and ":" in c + ), "should show module:app format for APP_MODULE" + + def test_s2i_build_phases(self): + """Skill teaches S2I build phases: assemble (install deps, compile) and + run (start app). Without skill, agents treat S2I as a black box.""" + c = read_report().lower() + assert ("assemble" in c and ("run" in c or "start" in c)) or ( + "build phase" in c or "build step" in c or "build process" in c + ), "should explain S2I build phases (assemble and run)" + + def test_buildconfig_imagestream(self): + """Skill teaches creating ImageStream + BuildConfig with source/builder/output.""" + c = read_report().lower() + assert any(t in c for t in ["buildconfig", "imagestream", "build config"]) and any(t in c for t in [ + "source", "builder", "output" + ]), "should define BuildConfig/ImageStream" + + def test_gunicorn_requirement(self): + """Skill teaches gunicorn must be in requirements.txt for APP_MODULE.""" + c = read_report().lower() + assert "gunicorn" in c and any(t in c for t in [ + "requirements", "pip", "install", "wsgi", "app_module" + ]), "should address gunicorn requirement for S2I Python" + + def test_standalone_imagestream_yaml(self): + """Skill teaches creating ImageStream as a separate resource with + image.openshift.io/v1 API group and lookupPolicy. Without skill, + agents reference ImageStreamTag in BuildConfig but don't define + the ImageStream resource itself.""" + c = read_report() + has_is_api = "image.openshift.io" in c + has_lookup = "lookupPolicy" in c + assert has_is_api or has_lookup, ( + "should define ImageStream resource with image.openshift.io API" + ) + diff --git a/evaluation/without_skills/rh-developer__validate-environment/environment/Dockerfile b/evaluation/without_skills/rh-developer__validate-environment/environment/Dockerfile new file mode 100644 index 00000000..b01cae66 --- /dev/null +++ b/evaluation/without_skills/rh-developer__validate-environment/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-openshift-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-developer__validate-environment/environment/mcp-servers/mock-openshift-mcp.py b/evaluation/without_skills/rh-developer__validate-environment/environment/mcp-servers/mock-openshift-mcp.py new file mode 100644 index 00000000..dadb59fb --- /dev/null +++ b/evaluation/without_skills/rh-developer__validate-environment/environment/mcp-servers/mock-openshift-mcp.py @@ -0,0 +1,717 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for rh-developer benchmark task. + +Simulates an OpenShift cluster with 3 namespaces, each containing a broken +deployment that requires different debugging skills to diagnose: + + 1. api-platform / api-service (Python FastAPI) + - S2I build succeeded, pod crashes at runtime + - Entry point is main.py (not app.py), no gunicorn installed + - Requires python-s2i-entrypoints.md knowledge + + 2. web-frontend / web-frontend (Node.js React) + - Pod in CrashLoopBackOff, exit code 137 (OOMKilled) + - Container memory limit 64Mi is too low for Node.js + - Requires debugging-patterns.md exit code knowledge + + 3. order-system / order-service (Java Quarkus) + - Pod running, Route returns 503 + - Service selector mismatch: app=order-svc vs pod label app=order-service + - Tekton PipelineRun failed, logs in step-build container + - Requires debug-network + debug-pipeline knowledge + +Also provides application source metadata for image recommendation. +""" + +from typing import Optional +from fastmcp import FastMCP + +mcp = FastMCP("openshift") + + +# --------------------------------------------------------------------------- +# Namespace / Project data +# --------------------------------------------------------------------------- + +NAMESPACES = [ + {"name": "api-platform", "status": "Active", "labels": {"app-type": "backend"}}, + {"name": "web-frontend", "status": "Active", "labels": {"app-type": "frontend"}}, + {"name": "order-system", "status": "Active", "labels": {"app-type": "backend"}}, +] + + +# --------------------------------------------------------------------------- +# Deployment data +# --------------------------------------------------------------------------- + +DEPLOYMENTS = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "containers": [ + { + "name": "api-service", + "image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + "env": [ + {"name": "APP_SCRIPT", "value": ""}, + {"name": "APP_FILE", "value": "main.py"}, + ], + } + ], + "labels": {"app": "api-service", "deployment": "api-service"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "replicas": 1, + "available_replicas": 0, + "ready_replicas": 0, + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "containers": [ + { + "name": "web-frontend", + "image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "strategy": "RollingUpdate", + "status": "Available=False (0/1 replicas ready)", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "replicas": 1, + "available_replicas": 1, + "ready_replicas": 1, + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "containers": [ + { + "name": "order-service", + "image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + "ports": [{"containerPort": 8080, "protocol": "TCP"}], + } + ], + "labels": {"app": "order-service", "deployment": "order-service"}, + "strategy": "RollingUpdate", + "status": "Available=True (1/1 replicas ready)", + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod data +# --------------------------------------------------------------------------- + +PODS = { + "api-platform": [ + { + "name": "api-service-7b8f9d4c5-x2k9m", + "namespace": "api-platform", + "status": "CrashLoopBackOff", + "restart_count": 5, + "labels": {"app": "api-service", "deployment": "api-service"}, + "containers": [ + { + "name": "api-service", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 1, + "reason": "Error", + "message": "Application exited with error", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "100m", "memory": "256Mi"}, + "limits": {"cpu": "500m", "memory": "512Mi"}, + }, + } + ], + }, + ], + "web-frontend": [ + { + "name": "web-frontend-6c5d8b7a9-p4n2j", + "namespace": "web-frontend", + "status": "CrashLoopBackOff", + "restart_count": 8, + "labels": {"app": "web-frontend", "deployment": "web-frontend"}, + "containers": [ + { + "name": "web-frontend", + "state": "Waiting", + "reason": "CrashLoopBackOff", + "last_state": { + "terminated": { + "exit_code": 137, + "reason": "OOMKilled", + "message": "Container exceeded memory limit", + } + }, + "ready": False, + "resources": { + "requests": {"cpu": "50m", "memory": "32Mi"}, + "limits": {"cpu": "200m", "memory": "64Mi"}, + }, + } + ], + }, + ], + "order-system": [ + { + "name": "order-service-5a4b3c2d1-h7j6k", + "namespace": "order-system", + "status": "Running", + "restart_count": 0, + "labels": {"app": "order-service", "deployment": "order-service"}, + "containers": [ + { + "name": "order-service", + "state": "Running", + "ready": True, + "ports": [{"containerPort": 8080}], + "resources": { + "requests": {"cpu": "200m", "memory": "512Mi"}, + "limits": {"cpu": "1", "memory": "1Gi"}, + }, + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Pod logs +# --------------------------------------------------------------------------- + +POD_LOGS = { + "api-service-7b8f9d4c5-x2k9m": ( + "---> Running application from script (app.sh) ...\n" + "sh: app.sh: No such file or directory\n" + "---> Trying to run with gunicorn ...\n" + "Traceback (most recent call last):\n" + " File \"/opt/app-root/bin/gunicorn\", line 5, in \n" + " from gunicorn.app.wsgiapp import run\n" + "ModuleNotFoundError: No module named 'gunicorn'\n" + "---> Trying to run app.py ...\n" + "Error: Could not find '/opt/app-root/src/app.py'\n" + "---> Failed to find any valid entry point.\n" + " Set the APP_MODULE environment variable to specify your application callable.\n" + " Expected one of: app.sh, gunicorn with APP_MODULE, or app.py\n" + ), + "web-frontend-6c5d8b7a9-p4n2j": ( + "> react-app@1.0.0 start\n" + "> node server.js\n" + "\n" + "Server starting on port 3000...\n" + "Loading configuration...\n" + "Initializing middleware...\n" + "Killed\n" + ), + "order-service-5a4b3c2d1-h7j6k": ( + "__ ____ __ _____ ___ __ ____ ______ \n" + " --/ __ \\/ / / / _ | / _ \\/ //_/ / / / __/ \n" + " -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\\ \\ \n" + "--\\___\\_\\____/_/ |_/_/|_/_/|_|\\____/___/ \n" + "2026-02-15 10:30:15,234 INFO [io.quarkus] Quarkus 3.8.1 on JVM started in 2.345s.\n" + "2026-02-15 10:30:15,236 INFO [io.quarkus] Profile prod activated.\n" + "2026-02-15 10:30:15,237 INFO [io.quarkus] Installed features: [cdi, rest, smallrye-health]\n" + "2026-02-15 10:30:15,238 INFO [io.quarkus] Listening on: http://0.0.0.0:8080\n" + ), +} + + +# --------------------------------------------------------------------------- +# Build data +# --------------------------------------------------------------------------- + +BUILDS = { + "api-platform": [ + { + "name": "api-service-1", + "namespace": "api-platform", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/api-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/python:3.11-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest", + "duration": "2m15s", + }, + ], + "web-frontend": [ + { + "name": "web-frontend-1", + "namespace": "web-frontend", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/web-frontend.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/nodejs:20-ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest", + "duration": "3m42s", + }, + ], + "order-system": [ + { + "name": "order-service-1", + "namespace": "order-system", + "status": "Complete", + "source_type": "Git", + "source_uri": "https://github.com/example/order-service.git", + "strategy": "Source", + "builder_image": "image-registry.openshift-image-registry.svc:5000/openshift/openjdk-17:ubi9", + "output_image": "image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest", + "duration": "4m08s", + }, + ], +} + +BUILD_LOGS = { + "api-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/api-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image python:3.11-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from requirements.txt ...\n" + "Collecting fastapi==0.109.0\n" + "Collecting uvicorn==0.27.0\n" + "Collecting pydantic==2.5.3\n" + "Successfully installed fastapi-0.109.0 uvicorn-0.27.0 pydantic-2.5.3\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/api-platform/api-service:latest\n" + "Push successful\n" + ), + "web-frontend-1": ( + "===> STEP 1: Fetching source from https://github.com/example/web-frontend.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image nodejs:20-ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Installing dependencies from package.json ...\n" + "---> Running build script: npm run build ...\n" + "---> Build complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/web-frontend/web-frontend:latest\n" + "Push successful\n" + ), + "order-service-1": ( + "===> STEP 1: Fetching source from https://github.com/example/order-service.git\n" + "Cloning into '/tmp/src'...\n" + "===> STEP 2: Pulling builder image openjdk-17:ubi9\n" + "===> STEP 3: Running assemble script\n" + "---> Installing application source ...\n" + "---> Building with Maven ...\n" + "[INFO] BUILD SUCCESS\n" + "---> Assemble script complete.\n" + "===> STEP 4: Committing image\n" + "===> STEP 5: Pushing image to image-registry.openshift-image-registry.svc:5000/order-system/order-service:latest\n" + "Push successful\n" + ), +} + + +# --------------------------------------------------------------------------- +# Service data +# --------------------------------------------------------------------------- + +SERVICES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "type": "ClusterIP", + "cluster_ip": "172.30.45.112", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "api-service"}, + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "type": "ClusterIP", + "cluster_ip": "172.30.89.201", + "ports": [{"port": 3000, "target_port": 3000, "protocol": "TCP"}], + "selector": {"app": "web-frontend"}, + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "type": "ClusterIP", + "cluster_ip": "172.30.67.55", + "ports": [{"port": 8080, "target_port": 8080, "protocol": "TCP"}], + "selector": {"app": "order-svc"}, + }, + ], +} + + +# --------------------------------------------------------------------------- +# Route data +# --------------------------------------------------------------------------- + +ROUTES = { + "api-platform": [ + { + "name": "api-service", + "namespace": "api-platform", + "host": "api-service-api-platform.apps.cluster.example.com", + "path": "/", + "service": "api-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "web-frontend": [ + { + "name": "web-frontend", + "namespace": "web-frontend", + "host": "web-frontend-web-frontend.apps.cluster.example.com", + "path": "/", + "service": "web-frontend", + "port": 3000, + "tls_termination": "edge", + "status": "Admitted", + }, + ], + "order-system": [ + { + "name": "order-service", + "namespace": "order-system", + "host": "order-service-order-system.apps.cluster.example.com", + "path": "/", + "service": "order-service", + "port": 8080, + "tls_termination": "edge", + "status": "Admitted", + "conditions": [ + { + "type": "Admitted", + "status": "True", + "message": "Route admitted but backend returns 503 Service Unavailable", + } + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Events +# --------------------------------------------------------------------------- + +EVENTS = { + "api-platform": [ + {"type": "Normal", "reason": "Created", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Created container api-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Started container api-service"}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/api-service-7b8f9d4c5-x2k9m", + "message": "Back-off restarting failed container api-service"}, + ], + "web-frontend": [ + {"type": "Normal", "reason": "Created", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Created container web-frontend"}, + {"type": "Normal", "reason": "Started", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Started container web-frontend"}, + {"type": "Warning", "reason": "OOMKilled", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Container web-frontend was OOMKilled (exit code 137). Memory limit: 64Mi."}, + {"type": "Warning", "reason": "BackOff", "object": "Pod/web-frontend-6c5d8b7a9-p4n2j", + "message": "Back-off restarting failed container web-frontend"}, + ], + "order-system": [ + {"type": "Normal", "reason": "Created", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Created container order-service"}, + {"type": "Normal", "reason": "Started", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Started container order-service"}, + {"type": "Normal", "reason": "Scheduled", "object": "Pod/order-service-5a4b3c2d1-h7j6k", + "message": "Successfully assigned order-system/order-service-5a4b3c2d1-h7j6k to worker-2"}, + {"type": "Warning", "reason": "FailedPipelineRun", "object": "PipelineRun/order-service-deploy-run-7x2k", + "message": "PipelineRun failed at task 'integration-test'. Check step-build and step-test containers for logs."}, + ], +} + + +# --------------------------------------------------------------------------- +# Tekton pipeline data +# --------------------------------------------------------------------------- + +PIPELINE_RUNS = { + "order-system": [ + { + "name": "order-service-deploy-run-7x2k", + "namespace": "order-system", + "pipeline": "order-service-deploy", + "status": "Failed", + "start_time": "2026-02-15T09:15:00Z", + "completion_time": "2026-02-15T09:22:30Z", + "task_runs": [ + { + "name": "order-service-deploy-run-7x2k-build", + "task": "build", + "status": "Succeeded", + "steps": [ + {"name": "step-git-clone", "status": "Completed", "exit_code": 0}, + {"name": "step-build", "status": "Completed", "exit_code": 0}, + {"name": "step-push", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-deploy", + "task": "deploy", + "status": "Succeeded", + "steps": [ + {"name": "step-deploy", "status": "Completed", "exit_code": 0}, + ], + }, + { + "name": "order-service-deploy-run-7x2k-integration-test", + "task": "integration-test", + "status": "Failed", + "steps": [ + {"name": "step-test", "status": "Failed", "exit_code": 1, + "log": ( + "Running integration tests against order-service...\n" + "GET https://order-service-order-system.apps.cluster.example.com/api/health\n" + "Response: 503 Service Unavailable\n" + "FAIL: Health check returned 503, expected 200\n" + "Hint: Service endpoint is unreachable. Verify service routing.\n" + )}, + ], + }, + ], + }, + ], +} + + +# --------------------------------------------------------------------------- +# Application source metadata (for image recommendation) +# --------------------------------------------------------------------------- + +APP_SOURCES = { + "inventory-api": { + "name": "inventory-api", + "language": "Python", + "version": "3.11", + "framework": "Flask", + "entry_point": "app.py", + "dependencies": ["flask==3.0.0", "sqlalchemy==2.0.25", "gunicorn==21.2.0", "psycopg2-binary==2.9.9"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/inventory-api.git", + }, + "customer-portal": { + "name": "customer-portal", + "language": "Node.js", + "version": "20", + "framework": "React (Next.js)", + "entry_point": "server.js", + "dependencies": ["next@14.1.0", "react@18.2.0", "express@4.18.2"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/customer-portal.git", + }, + "payment-processor": { + "name": "payment-processor", + "language": "Java", + "version": "17", + "framework": "Quarkus", + "entry_point": "src/main/java/com/example/Application.java", + "build_tool": "Maven", + "dependencies": ["quarkus-rest", "quarkus-hibernate-orm-panache", "quarkus-jdbc-postgresql"], + "target": "production", + "has_dockerfile": False, + "has_tests": True, + "repo": "https://github.com/example/payment-processor.git", + "notes": "Quarkus application. Consider native compilation for production.", + }, +} + + +# --------------------------------------------------------------------------- +# MCP Tools +# --------------------------------------------------------------------------- + +@mcp.tool +def list_projects() -> dict: + """List all OpenShift projects (namespaces) in the cluster. + + Returns project names, status, and labels. + """ + return {"projects": NAMESPACES, "count": len(NAMESPACES)} + + +@mcp.tool +def get_deployments(namespace: str) -> dict: + """Get deployments in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + deps = DEPLOYMENTS.get(namespace, []) + return {"deployments": deps, "count": len(deps), "namespace": namespace} + + +@mcp.tool +def get_pods(namespace: str) -> dict: + """Get pods in a namespace with their status and container details. + + Args: + namespace: The OpenShift namespace/project name. + """ + pods = PODS.get(namespace, []) + return {"pods": pods, "count": len(pods), "namespace": namespace} + + +@mcp.tool +def pod_logs(pod_name: str, namespace: str, previous: bool = False) -> dict: + """Get logs from a pod. + + Args: + pod_name: Name of the pod. + namespace: The OpenShift namespace/project name. + previous: If True, get logs from the previous terminated container. + """ + logs = POD_LOGS.get(pod_name, f"No logs available for pod {pod_name}") + return {"pod": pod_name, "namespace": namespace, "logs": logs, "previous": previous} + + +@mcp.tool +def get_builds(namespace: str) -> dict: + """Get builds in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + builds = BUILDS.get(namespace, []) + return {"builds": builds, "count": len(builds), "namespace": namespace} + + +@mcp.tool +def get_build_log(build_name: str, namespace: str) -> dict: + """Get the log output from a build. + + Args: + build_name: Name of the build (e.g. 'api-service-1'). + namespace: The OpenShift namespace/project name. + """ + log = BUILD_LOGS.get(build_name, f"No build log found for {build_name}") + return {"build": build_name, "namespace": namespace, "log": log} + + +@mcp.tool +def get_services(namespace: str) -> dict: + """Get services in a namespace with their selectors and ports. + + Args: + namespace: The OpenShift namespace/project name. + """ + svcs = SERVICES.get(namespace, []) + return {"services": svcs, "count": len(svcs), "namespace": namespace} + + +@mcp.tool +def get_routes(namespace: str) -> dict: + """Get routes in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + routes = ROUTES.get(namespace, []) + return {"routes": routes, "count": len(routes), "namespace": namespace} + + +@mcp.tool +def get_events(namespace: str) -> dict: + """Get events in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + events = EVENTS.get(namespace, []) + return {"events": events, "count": len(events), "namespace": namespace} + + +@mcp.tool +def get_pipeline_runs(namespace: str) -> dict: + """Get Tekton PipelineRuns in a namespace. + + Args: + namespace: The OpenShift namespace/project name. + """ + runs = PIPELINE_RUNS.get(namespace, []) + return {"pipeline_runs": runs, "count": len(runs), "namespace": namespace} + + +@mcp.tool +def get_app_source_info(app_name: str) -> dict: + """Get detected source information for an application project. + + Returns language, framework, version, dependencies, and deployment target. + + Args: + app_name: Application name (e.g. 'inventory-api', 'customer-portal', 'payment-processor'). + """ + if app_name in APP_SOURCES: + return APP_SOURCES[app_name] + return {"error": f"Application '{app_name}' not found. Available: {list(APP_SOURCES.keys())}"} + + +@mcp.tool +def list_available_apps() -> dict: + """List all application projects available for analysis. + + Returns names and basic metadata for applications that need + image recommendations or deployment planning. + """ + apps = [] + for name, info in APP_SOURCES.items(): + apps.append({ + "name": name, + "language": info["language"], + "version": info["version"], + "framework": info["framework"], + "target": info["target"], + }) + return {"applications": apps, "count": len(apps)} + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-developer__validate-environment/instruction.md b/evaluation/without_skills/rh-developer__validate-environment/instruction.md new file mode 100644 index 00000000..b9024f98 --- /dev/null +++ b/evaluation/without_skills/rh-developer__validate-environment/instruction.md @@ -0,0 +1,13 @@ +# Environment Validation Task + +You are a Red Hat developer. Before deploying a new application, you need to confirm the OpenShift environment is ready and properly configured. + +## Requirements +- Verify cluster connectivity: confirm you can reach the API server and authenticate successfully +- Check namespace readiness: does the target namespace exist, and do you have permissions to create deployments, services, and routes in it? +- Verify resource availability: are there sufficient CPU and memory quotas remaining for a new deployment? +- Produce a readiness checklist with pass/fail status for each check and an overall go/no-go recommendation + +Document your validation results and readiness assessment in `/root/report.md`. + +Use MCP tools to examine the cluster. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-developer__validate-environment/solution/solve.sh b/evaluation/without_skills/rh-developer__validate-environment/solution/solve.sh new file mode 100644 index 00000000..3cb34892 --- /dev/null +++ b/evaluation/without_skills/rh-developer__validate-environment/solution/solve.sh @@ -0,0 +1,36 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Environment Validation Report + +## Validation Scope: All +(Options: All, OpenShift, RHEL/Containers, Minimal) + +### Tool Availability +| Tool | Status | Version | +|------|--------|---------| +| git | OK | 2.43.0 | +| curl | OK | 8.5.0 | +| jq | OK | 1.7.1 | +| oc | OK | 4.14.0 | +| helm | OK | 3.14.0 | +| podman | OK | 4.9.0 | +| skopeo | OK | 1.14.0 | +| ssh | OK | OpenSSH 9.6 | + +Status indicators: OK (working), MISSING (not in PATH), WARN (optional missing). + +### OpenShift Permissions (oc auth can-i) +| Resource | Action | Status | +|----------|--------|--------| +| deployments | create | OK | +| buildconfigs | create | OK | +| imagestreams | create | OK | + +### Connectivity +- Cluster: Connected (`oc whoami` → admin) +- Podman info: `podman info --format '{{.Host.OS}} {{.Host.Arch}}'` → linux amd64 + +### Ready for +/detect-project, /s2i-build, /deploy, /helm-deploy, /containerize-deploy + +REPORT_EOF diff --git a/evaluation/without_skills/rh-developer__validate-environment/task.toml b/evaluation/without_skills/rh-developer__validate-environment/task.toml new file mode 100644 index 00000000..10df7267 --- /dev/null +++ b/evaluation/without_skills/rh-developer__validate-environment/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-developer__validate-environment" +name = "rh-developer Environment Validation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-developer", "validate-environment", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-developer__validate-environment/tests/llm_judge.py b/evaluation/without_skills/rh-developer__validate-environment/tests/llm_judge.py new file mode 100644 index 00000000..3545ef59 --- /dev/null +++ b/evaluation/without_skills/rh-developer__validate-environment/tests/llm_judge.py @@ -0,0 +1,108 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + { + "id": "skopeo_validation", + "file": "/root/report.md", + "question": "Does the report validate skopeo as a required tool in the environment?", + "reference": "A skilled report checks that skopeo is installed and available, as it is needed for image recommendation workflows. An unskilled report skips skopeo validation." + }, + { + "id": "rbac_permission_checks", + "file": "/root/report.md", + "question": "Does the report verify that the user has permissions to create deployments, buildconfigs, or imagestreams in the target namespace?", + "reference": "A skilled report checks create permissions for deployments, buildconfigs, and imagestreams — either via 'oc auth can-i' commands or via MCP/API queries that verify the same permissions. An unskilled report only checks identity (oc whoami) without verifying specific resource permissions." + }, + { + "id": "structured_validation_report", + "file": "/root/report.md", + "question": "Does the report present environment validation results in a structured pass/fail format with remediation for failures?", + "reference": "A skilled report presents each check as pass/fail with an overall go/no-go assessment and remediation steps. An unskilled report lists tools without structured evaluation." + } +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-developer__validate-environment/tests/test.sh b/evaluation/without_skills/rh-developer__validate-environment/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-developer__validate-environment/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-developer__validate-environment/tests/test_outputs.py b/evaluation/without_skills/rh-developer__validate-environment/tests/test_outputs.py new file mode 100644 index 00000000..8f62b808 --- /dev/null +++ b/evaluation/without_skills/rh-developer__validate-environment/tests/test_outputs.py @@ -0,0 +1,67 @@ +""" +Tests for rh-developer__validate-environment per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: methodology checks that require skill knowledge. +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_environment(self): + content = read_report().lower() + assert any(t in content for t in ["environment", "cluster", "ready", "validation"]), ( + "report should mention environment validation" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 100, "report should have substantial content" + + +class TestSkillDependent: + def test_skopeo_as_required_tool(self): + """Skill teaches skopeo is a required dependency for image recommendation flows. + Without skill, agents skip skopeo in environment validation.""" + c = read_report().lower() + assert "skopeo" in c, ( + "should validate skopeo as a required tool" + ) + + def test_oc_auth_can_i_checks(self): + """Skill teaches oc auth can-i create deployments/buildconfigs/imagestreams + for permission checks. Without skill, agents only check oc whoami.""" + c = read_report().lower() + has_permission_method = ("auth can-i" in c or "can-i" in c or "permission" in c) + has_resource_type = any(t in c for t in [ + "deployment", "buildconfig", "imagestream", "create" + ]) + assert has_permission_method and has_resource_type, ( + "should verify create permissions for deployments/buildconfigs/imagestreams" + ) + + def test_tool_version_checks(self): + """Skill teaches checking version/availability of oc, helm, podman, git.""" + c = read_report().lower() + tools = ["oc", "helm", "podman", "git", "skopeo"] + mentioned = sum(1 for t in tools if t in c) + assert mentioned >= 3, "should validate multiple CLI tools" + + def test_structured_pass_fail(self): + """Skill teaches presenting results as pass/fail per check.""" + c = read_report().lower() + assert any(t in c for t in ["pass", "fail", "missing", "go", "no-go", "available"]) and any(t in c for t in [ + "tool", "check", "oc", "helm", "result" + ]), "should provide structured pass/fail validation report" diff --git a/evaluation/without_skills/rh-sre__cve-impact/environment/Dockerfile b/evaluation/without_skills/rh-sre__cve-impact/environment/Dockerfile new file mode 100644 index 00000000..5a2cfdee --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-impact/environment/Dockerfile @@ -0,0 +1,43 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__cve-impact/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__cve-impact/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-impact/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__cve-impact/instruction.md b/evaluation/without_skills/rh-sre__cve-impact/instruction.md new file mode 100644 index 00000000..00b38e1d --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-impact/instruction.md @@ -0,0 +1,14 @@ +# CVE Impact Analysis Task + +You are a Red Hat SRE. A critical vulnerability has been announced, and management needs to know how many of your systems are affected before deciding on emergency patching. + +## Requirements +- Query your fleet to identify all systems affected by the CVE +- Break down the impact by environment (production vs staging vs development) and by RHEL version +- Report total affected system count, and flag any high-criticality systems (e.g., customer-facing, compliance-regulated) +- If results span multiple pages, ensure you capture the complete picture +- Provide a risk summary: severity, exposure scope, and recommended urgency level + +Document your methodology, impact analysis, and risk assessment in `/root/report.md`. + +Use MCP tools to query vulnerability data. If reference documentation or skills are available in this environment, consult them before beginning work. Complete the entire analysis autonomously — do not stop to ask for user confirmation or input at any checkpoint. Use reasonable defaults (e.g., fetch all available data) and proceed through every step to produce the final report. diff --git a/evaluation/without_skills/rh-sre__cve-impact/solution/solve.sh b/evaluation/without_skills/rh-sre__cve-impact/solution/solve.sh new file mode 100644 index 00000000..fbbfb891 --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-impact/solution/solve.sh @@ -0,0 +1,15 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# CVE Impact Analysis + +## CVE-2024-12345 +- Severity: Critical (CVSS 9.8) +- Affected systems: 6 +- Patched: 2 +- Vulnerable: 4 + +## Pagination +Used limit=100 per page, system_uuid for system-level queries. First page only often returns 0 remediatable CVEs—systems may have 1700+ CVEs (~18 API calls). Recommend "all pages" for remediatable queries. + +## Data parsed using cve-response-parser.py +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__cve-impact/task.toml b/evaluation/without_skills/rh-sre__cve-impact/task.toml new file mode 100644 index 00000000..1ef53278 --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-impact/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__cve-impact" +name = "rh-sre CVE Impact Analysis Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "cve-impact", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__cve-impact/tests/llm_judge.py b/evaluation/without_skills/rh-sre__cve-impact/tests/llm_judge.py new file mode 100644 index 00000000..91bf4254 --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-impact/tests/llm_judge.py @@ -0,0 +1,94 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "comprehensive_cve_coverage", "file": "/root/report.md", "question": "Does the report analyze multiple CVEs from the fleet inventory (at least 3 distinct CVE IDs) with per-CVE detail, rather than focusing only on a single CVE?", "reference": "A skilled report comprehensively assesses all CVEs affecting the fleet (CVE-2024-12345, 54321, 98765, 11111, 22222) with individual analysis. An unskilled report often focuses only on the primary CVE-2024-12345."}, + {"id": "multi_environment_breakdown", "file": "/root/report.md", "question": "Does the report break down CVE impact across at least 3 distinct environment tiers (e.g., production, staging, development, QA, legacy) with per-environment system counts or status?", "reference": "A skilled report categorizes affected systems by environment tier (production, staging, dev, QA, legacy) with counts per environment. An unskilled report provides aggregate totals without environment-level detail."}, + {"id": "risk_assessment", "file": "/root/report.md", "question": "Does the report provide a risk assessment that considers multiple factors such as CVSS score, affected system count, and environment criticality?", "reference": "A skilled report includes a multi-factor risk assessment. An unskilled report gives generic severity ratings without combining multiple factors."}, + {"id": "operational_priority_ranking", "file": "/root/report.md", "question": "Does the report assign explicit operational priority tiers (like P0/P1/P2 or Priority 1/2/3) to CVEs with associated SLA timeframes (e.g., within 24 hours, within 7 days), going beyond just severity labels?", "reference": "A skilled report maps CVE severity to operational priority tiers (P0=immediate/24h, P1=7 days, P2=30 days) with concrete remediation deadlines. An unskilled report uses only vendor severity labels (Critical/Important/Moderate) without operational priority mapping."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__cve-impact/tests/test.sh b/evaluation/without_skills/rh-sre__cve-impact/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-impact/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__cve-impact/tests/test_outputs.py b/evaluation/without_skills/rh-sre__cve-impact/tests/test_outputs.py new file mode 100644 index 00000000..d5edc006 --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-impact/tests/test_outputs.py @@ -0,0 +1,92 @@ +""" +Tests for rh-sre__cve-impact per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_cve(self): + content = read_report().lower() + assert "cve" in content, "report should mention CVEs" + + def test_mentions_impact(self): + content = read_report().lower() + assert any(t in content for t in ["impact", "affected", "system", "fleet"]), ( + "report should discuss impact" + ) + + +class TestSkillDependent: + def test_full_cve_coverage(self): + """Skill teaches comprehensive fleet-wide CVE assessment across all CVEs. + Without skill, agents often focus only on the primary CVE.""" + c = read_report() + cve_ids = ["CVE-2024-12345", "CVE-2024-54321", "CVE-2024-98765", + "CVE-2024-11111", "CVE-2024-22222"] + found = sum(1 for cve in cve_ids if cve in c) + assert found >= 3, ( + f"should analyze multiple CVEs from fleet (found {found}/5); " + "skill teaches comprehensive multi-CVE assessment" + ) + + def test_prioritized_remediation_order(self): + """Skill teaches prioritizing CVEs with explicit priority ranking + (P0/P1/P2 or similar ordered tiers). Without skill, agents list by + severity without operational priority ranking.""" + c = read_report() + has_priority = any(t in c for t in [ + "P0", "P1", "P2", "Priority 0", "Priority 1", "Priority 2", + ]) or any(t in c.lower() for t in [ + "priority order", "remediation priority", "remediation order", + "triage priority", "priority ranking", "prioritized order", + ]) + assert has_priority, ( + "should assign explicit priority ranking (P0/P1/P2 or equivalent) to CVEs" + ) + + def test_multi_environment_breakdown(self): + """Skill teaches breaking down impact by environment (prod/staging/dev/QA/legacy). + Without skill, agents report aggregate counts without per-environment detail.""" + c = read_report().lower() + envs = ["production", "staging", "development", "qa", "legacy", "dev"] + found = sum(1 for e in envs if e in c) + assert found >= 3, ( + f"should break down impact across multiple environments (found {found}); " + "skill teaches per-environment categorization" + ) + + def test_risk_assessment_structure(self): + """Skill: Risk assessment with CVSS, affected count, environment criticality.""" + c = read_report().lower() + has_risk = any(t in c for t in ["risk", "priority", "urgency", "criticality"]) + has_factors = any(t in c for t in ["cvss", "affect", "severity", "count", "staging", "criticality"]) + assert has_risk and has_factors, ( + "should provide risk assessment with multiple factors (skill: Step 5)" + ) + + def test_classification_methodology(self): + """Skill teaches using classification criteria/methodology for CVE interpretation. + Without skill, agents classify severity ad-hoc.""" + c = read_report().lower() + assert any(t in c for t in [ + "classification", "methodology", "criteria", + "vulnerability-logic", "cvss-scoring", + "scoring framework", "risk framework", + ]) or ("consult" in c and "reference" in c), ( + "should reference classification methodology for CVE interpretation" + ) diff --git a/evaluation/without_skills/rh-sre__cve-validation/environment/Dockerfile b/evaluation/without_skills/rh-sre__cve-validation/environment/Dockerfile new file mode 100644 index 00000000..5a2cfdee --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-validation/environment/Dockerfile @@ -0,0 +1,43 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__cve-validation/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__cve-validation/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-validation/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__cve-validation/instruction.md b/evaluation/without_skills/rh-sre__cve-validation/instruction.md new file mode 100644 index 00000000..27325f5c --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-validation/instruction.md @@ -0,0 +1,12 @@ +# CVE Validation Task + +You are a Red Hat SRE. Validate a set of CVEs to determine which are real, applicable, and remediable on your fleet. + +## Requirements +- Validate CVE identifiers and severity +- Determine which CVEs have available fixes or advisories +- Classify CVEs by remediation status + +Use MCP tools to query vulnerability data. Document your methodology, validation results, and classification in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. Complete the entire analysis autonomously — do not stop after preliminary steps like MCP validation. Proceed through CVE querying, validation, classification, and report generation without waiting for user input. diff --git a/evaluation/without_skills/rh-sre__cve-validation/solution/solve.sh b/evaluation/without_skills/rh-sre__cve-validation/solution/solve.sh new file mode 100644 index 00000000..f4350508 --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-validation/solution/solve.sh @@ -0,0 +1,14 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# CVE Validation Report + +## CVE-2024-12345 +- Format: Valid (^CVE-\d{4}-\d{4,7}$) +- Advisory available: Yes (advisory_available, advisories_list) +- Do NOT use rules[] for remediation decision +- Remediation status: automated_remediation_available +- Validation status: valid +- Severity: Critical (Red Hat) +- Affected packages: httpd 2.4.37-1.el8 → 2.4.37-2.el8 +- Priority: P0 (24 hours) +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__cve-validation/task.toml b/evaluation/without_skills/rh-sre__cve-validation/task.toml new file mode 100644 index 00000000..98d08db5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-validation/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__cve-validation" +name = "rh-sre CVE Validation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "cve-validation", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__cve-validation/tests/llm_judge.py b/evaluation/without_skills/rh-sre__cve-validation/tests/llm_judge.py new file mode 100644 index 00000000..f0df9c9c --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-validation/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "advisory_not_rules", "file": "/root/report.md", "question": "Does the report use advisory_available or advisories_list (not rules[]) to determine remediation availability?", "reference": "A skilled report checks advisory_available/advisories_list for remediation status. An unskilled report incorrectly uses rules[] which is the Advisor engine."}, + {"id": "format_validation", "file": "/root/report.md", "question": "Does the report validate CVE format and accept 4-7 digit sequence numbers?", "reference": "A skilled report accepts CVE IDs with 4-7 digit sequences. An unskilled report may reject valid CVEs with non-5-digit sequences."}, + {"id": "structured_output", "file": "/root/report.md", "question": "Does the report output validation_status and remediation availability in a structured format?", "reference": "A skilled report presents clear validation_status and automated_remediation_available fields."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__cve-validation/tests/test.sh b/evaluation/without_skills/rh-sre__cve-validation/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-validation/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__cve-validation/tests/test_outputs.py b/evaluation/without_skills/rh-sre__cve-validation/tests/test_outputs.py new file mode 100644 index 00000000..21b9262c --- /dev/null +++ b/evaluation/without_skills/rh-sre__cve-validation/tests/test_outputs.py @@ -0,0 +1,81 @@ +""" +Tests for rh-sre__cve-validation per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_cve(self): + content = read_report().lower() + assert "cve" in content, "report should mention CVEs" + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_format_then_api_validation(self): + """Skill: Validate format (regex) first; if valid, ALWAYS call get_cve—do not reject on year/sequence.""" + c = read_report().lower() + has_format = any(t in c for t in ["regex", "pattern", "cve-", "cve-format", "year/sequence"]) + has_api_call = any(t in c for t in ["get_cve", "call", "api", "retrieve", "fetch"]) + assert has_format or has_api_call, ( + "should validate format then call get_cve (skill: do NOT reject on year/sequence before API)" + ) + + def test_advisory_available_not_rules(self): + """Skill teaches remediation determined by advisory_available/advisories_list/remediation field, NOT by rules[].""" + c = read_report().lower() + assert any(t in c for t in ["advisory_available", "advisories_list"]), ( + "should use advisory_available or advisories_list for remediation (skill: rules[] is wrong)" + ) + + def test_cve_regex_acceptance(self): + """Skill teaches CVE sequence is 4-7 digits (not always 5).""" + c = read_report().lower() + assert any(t in c for t in ["4,7", "4-7", "4-7 digit", "4 to 7", "regex"]), ( + "should accept CVE sequence 4-7 digits (skill: not always 5 digits)" + ) + + def test_validation_status_output(self): + """Skill: Return validation_status and remediation_status.automated_remediation_available.""" + c = read_report().lower() + has_status = any(t in c for t in ["validation_status", "valid", "invalid", "not_remediable"]) + has_remediation_flag = any(t in c for t in ["automated_remediation", "automated", "manual", "remediat"]) + assert has_status and has_remediation_flag, ( + "should output validation_status and remediation availability" + ) + + def test_affected_packages_with_versions(self): + """Skill: Identify affected packages with current and fixed versions.""" + c = read_report().lower() + has_packages = any(t in c for t in ["package", "affected", "component"]) + has_versions = any(t in c for t in ["version", "fixed", "current", "el8", "el9"]) + assert has_packages and has_versions, ( + "should identify packages with version info (skill: for playbook-generator)" + ) + + def test_remediation_field_value(self): + """Docs teach remediation==2 means automated remediation available. + Without docs, agents don't know the numeric remediation field semantics.""" + c = read_report().lower() + assert any(t in c for t in [ + "remediation==2", "remediation=2", "remediation field", "remediation value", + "automated remediation", + ]), "should interpret remediation field value (2=automated)" diff --git a/evaluation/without_skills/rh-sre__execution-summary/environment/Dockerfile b/evaluation/without_skills/rh-sre__execution-summary/environment/Dockerfile new file mode 100644 index 00000000..5a2cfdee --- /dev/null +++ b/evaluation/without_skills/rh-sre__execution-summary/environment/Dockerfile @@ -0,0 +1,43 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__execution-summary/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__execution-summary/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/without_skills/rh-sre__execution-summary/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__execution-summary/instruction.md b/evaluation/without_skills/rh-sre__execution-summary/instruction.md new file mode 100644 index 00000000..5521bb63 --- /dev/null +++ b/evaluation/without_skills/rh-sre__execution-summary/instruction.md @@ -0,0 +1,15 @@ +# Execution Summary Task + +You are a Red Hat SRE. Your team just completed an emergency remediation of a critical CVE across your managed fleet. Management needs a structured post-incident execution summary. + +## Scenario +A critical kernel vulnerability was announced. Your team used automation tools to identify affected systems, generate remediation playbooks, execute patching, and verify the fix. Now you need to document what was done. + +## Requirements +- Use MCP tools to query the current state of the fleet, identify which systems were affected, and gather evidence of remediation actions taken +- Produce an execution summary that includes: what was done, which tools and automation were used, the sequence of steps, results and verification outcomes, and any remaining gaps +- Structure the summary so it can be reviewed by management and used for future incident response improvement + +Document the full execution summary, including your methodology and the tools used, in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-sre__execution-summary/solution/solve.sh b/evaluation/without_skills/rh-sre__execution-summary/solution/solve.sh new file mode 100644 index 00000000..68891309 --- /dev/null +++ b/evaluation/without_skills/rh-sre__execution-summary/solution/solve.sh @@ -0,0 +1,13 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Execution Summary + +**** EXECUTION SUMMARY START **** +Agents: None +Skills: rh-sre:fleet-inventory,rh-sre:cve-impact +Tools: lightspeed-mcp:get_host_details,lightspeed-mcp:get_cves +Docs: docs/references/cvss-scoring.md,docs/insights/vulnerability-logic.md +**** EXECUTION SUMMARY END **** + +This summary shows all agents, skills, tools, and documentation used during the workflow. +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__execution-summary/task.toml b/evaluation/without_skills/rh-sre__execution-summary/task.toml new file mode 100644 index 00000000..a983e99f --- /dev/null +++ b/evaluation/without_skills/rh-sre__execution-summary/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__execution-summary" +name = "rh-sre Execution Summary Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "execution-summary", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__execution-summary/tests/llm_judge.py b/evaluation/without_skills/rh-sre__execution-summary/tests/llm_judge.py new file mode 100644 index 00000000..c426d912 --- /dev/null +++ b/evaluation/without_skills/rh-sre__execution-summary/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "tool_purpose_attribution", "file": "/root/report.md", "question": "Does the report pair each tool or MCP call with an explicit purpose statement explaining why it was invoked, rather than just listing tools used or describing actions narratively?", "reference": "A skilled report explicitly pairs each tool (e.g., get_host_details, get_cves) with a 'Purpose:' or 'used to' statement. An unskilled report lists tools in a table or mentions them inline without structured per-tool purpose mapping."}, + {"id": "tool_purpose_categorization", "file": "/root/report.md", "question": "Does the report organize MCP tools into named functional categories (e.g., 'Vulnerability Assessment Tools', 'Fleet Management Tools', 'Remediation Tools') rather than listing all tools in a single flat list?", "reference": "A skilled report groups tools by their functional purpose into labeled sub-sections (e.g., 'Assessment Tools: get_cves, get_cve', 'Fleet Tools: get_host_details'). An unskilled report lists all tools sequentially without categorizing them by function."}, + {"id": "methodology_phases", "file": "/root/report.md", "question": "Does the report organize the execution into at least two named methodology phases (e.g., 'Data Collection', 'Evidence Gathering', 'Triage', 'Verification') as distinct sections or headings?", "reference": "A skilled report structures execution into named phases as headings or sections. An unskilled report presents a flat narrative or simple numbered list without phase-level organization."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__execution-summary/tests/test.sh b/evaluation/without_skills/rh-sre__execution-summary/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__execution-summary/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__execution-summary/tests/test_outputs.py b/evaluation/without_skills/rh-sre__execution-summary/tests/test_outputs.py new file mode 100644 index 00000000..6cd1228a --- /dev/null +++ b/evaluation/without_skills/rh-sre__execution-summary/tests/test_outputs.py @@ -0,0 +1,55 @@ +""" +Tests for rh-sre__execution-summary per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: skill-specific patterns (not generic report quality). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['summary', 'execution', 'remediation']), ( + "report should mention execution summary or remediation" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 300, "execution summary should be substantial" + + +class TestSkillDependent: + def test_methodology_phases(self): + """Structuring execution into methodology phases + (data collection, evidence gathering, etc.).""" + c = read_report().lower() + phase_terms = [ + "data collection", "evidence gathering", "discovery", + "triage", "assessment", "verification", + "phase 1", "phase 2", "step 1", "step 2", + ] + found = sum(1 for t in phase_terms if t in c) + assert found >= 2, ( + f"should organize execution into methodology phases (found {found})" + ) + + def test_docs_from_consulted(self): + """Extract docs from 'I consulted' statements; path from docs/ or skills/ onwards.""" + c = read_report().lower() + has_docs = any(t in c for t in ["docs/", "skills/", "consult", "documentation"]) + assert has_docs, ( + "should list documentation consulted" + ) diff --git a/evaluation/without_skills/rh-sre__fleet-inventory/environment/Dockerfile b/evaluation/without_skills/rh-sre__fleet-inventory/environment/Dockerfile new file mode 100644 index 00000000..5a2cfdee --- /dev/null +++ b/evaluation/without_skills/rh-sre__fleet-inventory/environment/Dockerfile @@ -0,0 +1,43 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__fleet-inventory/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__fleet-inventory/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/without_skills/rh-sre__fleet-inventory/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__fleet-inventory/instruction.md b/evaluation/without_skills/rh-sre__fleet-inventory/instruction.md new file mode 100644 index 00000000..3074bb9c --- /dev/null +++ b/evaluation/without_skills/rh-sre__fleet-inventory/instruction.md @@ -0,0 +1,17 @@ +# Fleet Inventory Task + +You are a Red Hat SRE. Your manager has asked for a current snapshot of all RHEL systems in your managed fleet ahead of an upcoming compliance audit. + +## Scenario +The compliance team needs to know exactly what systems you manage, their RHEL versions, patch levels, and any outstanding vulnerability exposure. They need this by end of day. + +## Requirements +- Query the fleet to enumerate all managed RHEL systems +- For each system, report: hostname, RHEL version, last check-in date, and patch status +- Identify which systems have outstanding CVEs, grouped by severity +- Flag any systems that are stale (not checking in) or running unsupported RHEL versions +- Summarize the fleet's overall health and compliance readiness + +Document your methodology, findings, and fleet assessment in `/root/report.md`. + +Use MCP tools to query data. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-sre__fleet-inventory/solution/solve.sh b/evaluation/without_skills/rh-sre__fleet-inventory/solution/solve.sh new file mode 100644 index 00000000..dc994408 --- /dev/null +++ b/evaluation/without_skills/rh-sre__fleet-inventory/solution/solve.sh @@ -0,0 +1,25 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Fleet Inventory Report + +## Systems Summary +| Hostname | RHEL | Environment | Status | Last Seen | +|----------|------|-------------|--------|-----------| +| web-01 | 9.3 | Production | Active | 2024-01-15 | +| db-01 | 9.3 | Production | Active | 2024-01-15 | +| dev-01 | 8.9 | Development | Stale | 2024-01-01 | + +## Data Source +Queried via `get_host_details` with pagination. Key fields: rhel_version, tags, stale, last_seen. + +## CVE-Affected Systems +Use `get_cve_systems` with cve_id (CVE-YYYY-NNNNN). Check remediation_available flag. + +## Status Interpretation +- **Vulnerable**: CVE affects system, patch not applied → suggest /remediation +- **Patched**: Previously affected, now remediated → no action +- **Not Affected**: Exclude from affected count + +## Next Steps +For CVE remediation, transition to /remediation skill. +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__fleet-inventory/task.toml b/evaluation/without_skills/rh-sre__fleet-inventory/task.toml new file mode 100644 index 00000000..cff6fe66 --- /dev/null +++ b/evaluation/without_skills/rh-sre__fleet-inventory/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__fleet-inventory" +name = "rh-sre Fleet Inventory Query Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "fleet-inventory", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__fleet-inventory/tests/llm_judge.py b/evaluation/without_skills/rh-sre__fleet-inventory/tests/llm_judge.py new file mode 100644 index 00000000..977611c9 --- /dev/null +++ b/evaluation/without_skills/rh-sre__fleet-inventory/tests/llm_judge.py @@ -0,0 +1,92 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "system_id_for_remediation", "file": "/root/report.md", "question": "Does the report track individual system identifiers (system_uuid, system_id, or host UUID) and link them to specific remediation follow-up actions, rather than just listing hostnames?", "reference": "A skilled report captures system UUIDs or identifiers to enable programmatic remediation API calls. An unskilled report lists hostnames or display names without machine-usable identifiers for follow-up."}, + {"id": "classification_methodology", "file": "/root/report.md", "question": "Does the report reference a classification methodology, classification criteria, or vulnerability classification framework for interpreting CVE status, rather than using ad-hoc severity labeling?", "reference": "A skilled report consults or references CVE classification criteria or methodology documentation before interpreting vulnerability data. An unskilled report classifies CVEs based on general knowledge without referencing established criteria."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__fleet-inventory/tests/test.sh b/evaluation/without_skills/rh-sre__fleet-inventory/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__fleet-inventory/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__fleet-inventory/tests/test_outputs.py b/evaluation/without_skills/rh-sre__fleet-inventory/tests/test_outputs.py new file mode 100644 index 00000000..f8c232d0 --- /dev/null +++ b/evaluation/without_skills/rh-sre__fleet-inventory/tests/test_outputs.py @@ -0,0 +1,67 @@ +""" +Tests for rh-sre__fleet-inventory per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['system', 'host', 'fleet', 'inventory']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_system_identifier_tracking(self): + """Skill teaches tracking system identifiers for follow-up actions. + Without skill, agents list systems without identifiers for remediation.""" + c = read_report().lower() + assert any(t in c for t in [ + "system id", "system_id", "system_uuid", "uuid", "identifier", + ]) and any(t in c for t in [ + "remediat", "follow-up", "subsequent", "action", "track", + ]), ( + "should track system identifiers for follow-up remediation actions" + ) + + def test_remediation_transition_offer(self): + """Skill: Offer transition to a remediation workflow for CVE remediation.""" + c = read_report().lower() + assert any(t in c for t in [ + "next step", "remediate", "playbook", + "remediation workflow", "remediation action", + ]), "should offer next steps for remediation" + + def test_classification_criteria_reference(self): + """Skill/docs teach consulting classification criteria or reference + documentation before interpreting vulnerability data. Without skill, + agents classify CVEs based on general knowledge alone.""" + c = read_report().lower() + assert any(t in c for t in [ + "classification criteria", "classification methodology", + "vulnerability classification", "cve classification", + ]) or ( + "classification" in c and any(t in c for t in [ + "consult", "reference", "methodology", "criteria", + ]) + ), "should reference CVE classification criteria or methodology" diff --git a/evaluation/without_skills/rh-sre__job-template-creator/environment/Dockerfile b/evaluation/without_skills/rh-sre__job-template-creator/environment/Dockerfile new file mode 100644 index 00000000..51ce02e5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-creator/environment/Dockerfile @@ -0,0 +1,47 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + }, \ + "aap-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-aap-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-aap-mcp.py b/evaluation/without_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-aap-mcp.py new file mode 100644 index 00000000..d8ae4fd5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-aap-mcp.py @@ -0,0 +1,1048 @@ +#!/usr/bin/env python3 +""" +Mock AAP (Ansible Automation Platform) MCP Server + +Simulates the AAP MCP gateway for per-skill evaluation tasks. Implements +the full set of tools used by rh-sre skills: + - job_templates_list / job_templates_retrieve + - projects_list + - job_templates_launch_retrieve + - jobs_retrieve / jobs_stdout_retrieve + - jobs_job_events_list / jobs_job_host_summaries_list + - jobs_relaunch_retrieve + - inventories_list / hosts_list + +Data mirrors a realistic AAP deployment: + - 6 job templates (3 remediation, 1 compliance, 1 patching, 1 reporting) + - 3 projects (remediation, compliance, reporting) + - 3 inventories (production 30 hosts, staging 15 hosts, all-managed 63 hosts) + - 12 recent jobs with varied statuses + +Follows the same mock-server pattern as mock-lightspeed-mcp.py. +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +mcp = FastMCP("aap-mcp") + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +def _ts(delta: timedelta) -> str: + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +# --------------------------------------------------------------------------- +# Mock data: Projects +# --------------------------------------------------------------------------- + +MOCK_PROJECTS = [ + { + "id": 6, + "type": "project", + "name": "Remediation Playbooks", + "description": "CVE and security remediation playbooks managed via Git", + "scm_type": "git", + "scm_url": "https://github.com/org/remediation-playbooks.git", + "scm_branch": "main", + "scm_revision": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2", + "status": "successful", + "last_job_run": _ts(timedelta(hours=2)), + "last_update_failed": False, + "created": _ts(timedelta(days=90)), + "modified": _ts(timedelta(hours=2)), + }, + { + "id": 7, + "type": "project", + "name": "Compliance Checks", + "description": "STIG and CIS compliance scanning playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/compliance-playbooks.git", + "scm_branch": "main", + "scm_revision": "b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3", + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "last_update_failed": False, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 8, + "type": "project", + "name": "Fleet Reporting", + "description": "System inventory and health reporting playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/fleet-reports.git", + "scm_branch": "main", + "scm_revision": "c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4", + "status": "successful", + "last_job_run": _ts(timedelta(days=3)), + "last_update_failed": False, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=3)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Inventories & Hosts +# --------------------------------------------------------------------------- + +MOCK_INVENTORIES = [ + { + "id": 1, + "type": "inventory", + "name": "Production Systems", + "description": "All production RHEL systems across data centers", + "total_hosts": 30, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 5, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 2, + "type": "inventory", + "name": "Staging Systems", + "description": "Pre-production staging environment", + "total_hosts": 15, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 3, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=300)), + "modified": _ts(timedelta(days=7)), + }, + { + "id": 3, + "type": "inventory", + "name": "All Managed Systems", + "description": "Complete fleet: production, staging, development, QA, legacy", + "total_hosts": 63, + "has_active_failures": True, + "hosts_with_active_failures": 2, + "total_groups": 8, + "groups_with_active_failures": 1, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(hours=6)), + }, +] + + +def _generate_hosts(inventory_id: int) -> list[dict]: + """Generate realistic hosts for an inventory.""" + hosts: list[dict] = [] + if inventory_id == 1: + roles = ["web", "db", "app", "lb", "monitoring", "cache"] + for i, role in enumerate(roles): + for j in range(5 if role in ("web", "app") else 4 if role == "db" else 3 if role == "monitoring" else 2): + hosts.append({ + "id": len(hosts) + 1, + "type": "host", + "name": f"{role}-{j+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "production", "role": "{role}"}}', + }) + if len(hosts) >= 30: + break + if len(hosts) >= 30: + break + elif inventory_id == 2: + for i in range(15): + role = ["web", "db", "app"][i % 3] + hosts.append({ + "id": 100 + i, + "type": "host", + "name": f"{role}-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging", "role": "{role}"}}', + }) + elif inventory_id == 3: + for i in range(30): + hosts.append({ + "id": 200 + i, + "type": "host", + "name": f"host-{i+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": i in (45, 58), + "variables": f'{{"rhel_version": "9.3", "environment": "production"}}', + }) + for i in range(15): + hosts.append({ + "id": 230 + i, + "type": "host", + "name": f"host-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging"}}', + }) + for i in range(10): + hosts.append({ + "id": 245 + i, + "type": "host", + "name": f"dev-{i+1:02d}.dev.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "8.9", "environment": "development"}}', + }) + for i in range(5): + hosts.append({ + "id": 255 + i, + "type": "host", + "name": f"qa-{i+1:02d}.qa.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.2", "environment": "qa"}}', + }) + for i in range(3): + hosts.append({ + "id": 260 + i, + "type": "host", + "name": f"legacy-{i+1:02d}.corp.example.com", + "inventory": inventory_id, + "enabled": i < 2, + "has_active_failures": i == 2, + "variables": f'{{"rhel_version": "7.9", "environment": "legacy"}}', + }) + return hosts + + +# --------------------------------------------------------------------------- +# Mock data: Job Templates +# --------------------------------------------------------------------------- + +MOCK_JOB_TEMPLATES = [ + { + "id": 10, + "type": "job_template", + "name": "CVE Remediation - Kernel Update", + "description": "Kernel update with boom snapshot for rollback safety", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "successful", + "last_job_run": _ts(timedelta(hours=4)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1001, "status": "successful", "finished": _ts(timedelta(hours=4))}, + }, + "created": _ts(timedelta(days=60)), + "modified": _ts(timedelta(days=2)), + }, + { + "id": 11, + "type": "job_template", + "name": "CVE Remediation - Package Update", + "description": "General package update for CVE remediation with needs-restarting check", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "check", + "verbosity": 1, + "timeout": 1800, + "forks": 10, + "status": "successful", + "last_job_run": _ts(timedelta(hours=12)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1005, "status": "successful", "finished": _ts(timedelta(hours=12))}, + }, + "created": _ts(timedelta(days=45)), + "modified": _ts(timedelta(days=5)), + }, + { + "id": 12, + "type": "job_template", + "name": "CVE Remediation - Generic", + "description": "Generic CVE remediation template for ad-hoc patches", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-remediation.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "never updated", + "last_job_run": None, + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + }, + "created": _ts(timedelta(days=30)), + "modified": _ts(timedelta(days=30)), + }, + { + "id": 20, + "type": "job_template", + "name": "Compliance Check - STIG", + "description": "Run STIG compliance scan across fleet", + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": False, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 7200, + "forks": 20, + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "summary_fields": { + "project": {"id": 7, "name": "Compliance Checks", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 2, "name": "compliance-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1010, "status": "successful", "finished": _ts(timedelta(days=1))}, + }, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=14)), + }, + { + "id": 25, + "type": "job_template", + "name": "Emergency Patching", + "description": "Emergency patch application — NO become enabled (misconfigured)", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": False, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 600, + "forks": 25, + "status": "failed", + "last_job_run": _ts(timedelta(days=7)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1020, "status": "failed", "finished": _ts(timedelta(days=7))}, + }, + "created": _ts(timedelta(days=200)), + "modified": _ts(timedelta(days=200)), + }, + { + "id": 30, + "type": "job_template", + "name": "Fleet Health Report", + "description": "Generate fleet health and inventory report", + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": True, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 1800, + "forks": 30, + "status": "successful", + "last_job_run": _ts(timedelta(hours=6)), + "summary_fields": { + "project": {"id": 8, "name": "Fleet Reporting", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1025, "status": "successful", "finished": _ts(timedelta(hours=6))}, + }, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=14)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Jobs (recent runs) +# --------------------------------------------------------------------------- + +PROD_HOSTS = [ + "web-01.prod.example.com", + "web-02.prod.example.com", + "db-01.prod.example.com", + "db-02.prod.example.com", + "app-01.prod.example.com", + "app-02.prod.example.com", +] + +MOCK_JOBS = [ + { + "id": 1001, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "check", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=4, minutes=30)), + "finished": _ts(timedelta(hours=4)), + "elapsed": 1800.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1002, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=3, minutes=45)), + "finished": _ts(timedelta(hours=3)), + "elapsed": 2700.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1005, + "type": "job", + "name": "CVE Remediation - Package Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=12, minutes=20)), + "finished": _ts(timedelta(hours=12)), + "elapsed": 1200.0, + "job_template": 11, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "limit": "", + "extra_vars": '{"target_cve": "CVE-2024-54321"}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 11, "name": "CVE Remediation - Package Update"}, + }, + }, + { + "id": 1010, + "type": "job", + "name": "Compliance Check - STIG", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(days=1, hours=2)), + "finished": _ts(timedelta(days=1)), + "elapsed": 7200.0, + "job_template": 20, + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 20, "name": "Compliance Check - STIG"}, + }, + }, + { + "id": 1020, + "type": "job", + "name": "Emergency Patching", + "job_type": "run", + "status": "failed", + "failed": True, + "started": _ts(timedelta(days=7, hours=1)), + "finished": _ts(timedelta(days=7)), + "elapsed": 3600.0, + "job_template": 25, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 25, "name": "Emergency Patching"}, + }, + }, + { + "id": 1025, + "type": "job", + "name": "Fleet Health Report", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=6, minutes=30)), + "finished": _ts(timedelta(hours=6)), + "elapsed": 1800.0, + "job_template": 30, + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 30, "name": "Fleet Health Report"}, + }, + }, +] + +_next_job_id = 2000 + + +# --------------------------------------------------------------------------- +# Mock stdout generators +# --------------------------------------------------------------------------- + +def _generate_stdout(job: dict) -> str: + """Generate realistic Ansible playbook stdout for a job.""" + playbook_name = job.get("name", "Unknown") + job_type = job.get("job_type", "run") + status = job.get("status", "successful") + limit = job.get("limit", "") + hosts = limit.split(",") if limit else PROD_HOSTS[:3] + hosts = [h.strip() for h in hosts if h.strip()] + extra_vars = job.get("extra_vars", "{}") + mode = " (CHECK MODE)" if job_type == "check" else "" + + lines = [] + lines.append(f"PLAY [{playbook_name}] *****") + lines.append("") + + lines.append(f"TASK [Gathering Facts{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}]") + lines.append("") + + if "kernel" in playbook_name.lower(): + lines.append(f"TASK [Create boom snapshot for rollback{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}] => {{\"msg\": \"boom create --title pre-remediation-CVE-2024-12345\"}}") + lines.append("") + + lines.append(f"TASK [Check disk space for kernel update{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}] => {{\"msg\": \"Disk space OK: 45% used\"}}") + lines.append("") + + lines.append(f"TASK [Update kernel package{mode}] *****") + for h in hosts: + result = "changed" if status == "successful" else "fatal" + if result == "changed": + lines.append(f'changed: [{h}] => {{"msg": "kernel-5.14.0-362.24.1.el9_3 -> kernel-5.14.0-362.24.2.el9_3"}}') + else: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Permission denied", "rc": 1}}') + lines.append("") + + lines.append(f"TASK [Check if reboot is needed (needs-restarting -r){mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"rc": 1, "msg": "Reboot is required to fully utilize updates."}}') + lines.append("") + + elif "package" in playbook_name.lower(): + lines.append(f"TASK [Update target packages for CVE remediation{mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"msg": "httpd-2.4.53-7.el9 -> httpd-2.4.57-8.el9"}}') + lines.append("") + + lines.append(f"TASK [Restart affected services{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append(f"TASK [Verify service health{mode}] *****") + for h in hosts: + lines.append(f'ok: [{h}] => {{"msg": "Service httpd is running"}}') + lines.append("") + + elif "emergency" in playbook_name.lower() and status == "failed": + lines.append(f"TASK [Apply emergency patch{mode}] *****") + for h in hosts: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Missing sudo password (become_enabled not set)", "rc": 1}}') + lines.append("") + lines.append("NO MORE HOSTS LEFT *****") + lines.append("") + + else: + lines.append(f"TASK [Execute playbook tasks{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append("PLAY RECAP *****") + for h in hosts: + if status == "successful": + ok_count = random.randint(3, 6) + changed_count = random.randint(1, 3) + lines.append(f"{h:<45} : ok={ok_count} changed={changed_count} unreachable=0 failed=0 skipped=0 rescued=0 ignored=0") + elif status == "failed": + lines.append(f"{h:<45} : ok=1 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0") + lines.append("") + + return "\n".join(lines) + + +def _generate_events(job: dict) -> list[dict]: + """Generate realistic Ansible task events for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + events: list[dict] = [] + eid = 1 + + task_names = ["Gathering Facts"] + if "kernel" in job.get("name", "").lower(): + task_names += [ + "Create boom snapshot for rollback", + "Check disk space for kernel update", + "Update kernel package", + "Check if reboot is needed (needs-restarting -r)", + ] + elif "package" in job.get("name", "").lower(): + task_names += [ + "Update target packages for CVE remediation", + "Restart affected services", + "Verify service health", + ] + else: + task_names += ["Execute playbook tasks"] + + for task_name in task_names: + for host in hosts: + is_failed = job.get("status") == "failed" and task_name != "Gathering Facts" + events.append({ + "id": eid, + "type": "job_event", + "event": "runner_on_ok" if not is_failed else "runner_on_failed", + "task": task_name, + "host": host, + "host_name": host, + "play": job.get("name", ""), + "changed": task_name != "Gathering Facts" and not is_failed, + "failed": is_failed, + "event_data": { + "task": task_name, + "host": host, + "res": { + "changed": task_name != "Gathering Facts" and not is_failed, + "msg": "Task completed" if not is_failed else "Permission denied", + }, + }, + "created": _ts(timedelta(hours=4, minutes=30 - eid)), + }) + eid += 1 + + return events + + +def _generate_host_summaries(job: dict) -> list[dict]: + """Generate per-host summaries for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + summaries: list[dict] = [] + + for i, host in enumerate(hosts): + is_failed = job.get("status") == "failed" + summaries.append({ + "id": i + 1, + "type": "job_host_summary", + "host": i + 1, + "host_name": host, + "ok": 1 if is_failed else random.randint(3, 6), + "changed": 0 if is_failed else random.randint(1, 3), + "dark": 0, + "failures": 1 if is_failed else 0, + "skipped": 0, + "processed": 1, + "failed": is_failed, + }) + + return summaries + + +# --------------------------------------------------------------------------- +# MCP Tools: Job Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def job_templates_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available job templates in AAP. + + Args: + page_size: Number of results per page (default 10, max 200). + search: Optional search string to filter templates by name. + """ + results = MOCK_JOB_TEMPLATES + if search: + s = search.lower() + results = [t for t in results if s in t["name"].lower() or s in t.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_retrieve(id: str) -> dict: + """Retrieve detailed information about a specific job template. + + Args: + id: Job template ID (as string). + """ + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + return template + + +@mcp.tool() +def projects_list( + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List available projects in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter projects by name. + """ + results = MOCK_PROJECTS + if search: + s = search.lower() + results = [p for p in results if s in p["name"].lower() or s in p.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_launch_retrieve( + id: str, + requestBody: Optional[dict] = None, +) -> dict: + """Launch a job from a job template. + + Args: + id: Job template ID to launch. + requestBody: Optional launch parameters including job_type ('run' or 'check'), + extra_vars (dict), and limit (comma-separated host list). + """ + global _next_job_id + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + + body = requestBody or {} + job_type = body.get("job_type", template.get("job_type", "run")) + + if not template.get("ask_job_type_on_launch") and job_type != template.get("job_type"): + return { + "error": f"Cannot override job_type: ask_job_type_on_launch is disabled on template {id}", + } + + job_id = _next_job_id + _next_job_id += 1 + + new_job = { + "id": job_id, + "type": "job", + "name": template["name"], + "job_type": job_type, + "status": "pending", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": None, + "elapsed": 0.0, + "job_template": tid, + "inventory": template["inventory"], + "project": template["project"], + "playbook": template["playbook"], + "limit": body.get("limit", ""), + "extra_vars": str(body.get("extra_vars", {})), + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": tid, "name": template["name"]}, + }, + } + MOCK_JOBS.append(new_job) + + # Simulate job completion after launch + new_job["status"] = "successful" + new_job["finished"] = _ts(timedelta(seconds=-300)) + new_job["elapsed"] = 300.0 + + return { + "job": job_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{job_id}/", + "related": { + "stdout": f"/api/controller/v2/jobs/{job_id}/stdout/", + "job_events": f"/api/controller/v2/jobs/{job_id}/job_events/", + "job_host_summaries": f"/api/controller/v2/jobs/{job_id}/job_host_summaries/", + }, + } + + +@mcp.tool() +def jobs_retrieve(id: int) -> dict: + """Get the status and details of a job run. + + Args: + id: Job ID to retrieve. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return job + + +@mcp.tool() +def jobs_list(page_size: int = 10) -> dict: + """List recent job runs. + + Args: + page_size: Number of results to return. + """ + results = sorted(MOCK_JOBS, key=lambda j: j.get("started", ""), reverse=True) + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def jobs_stdout_retrieve(id: int, format: str = "txt") -> dict: + """Get the stdout (console output) from a job run. + + Args: + id: Job ID. + format: Output format ('txt' or 'json'). Default 'txt'. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return { + "content": _generate_stdout(job), + "range": {"start": 0, "end": 1}, + } + + +@mcp.tool() +def jobs_job_events_list(id: int, page_size: int = 50) -> dict: + """Get task-level events for a job run. + + Args: + id: Job ID. + page_size: Number of events to return. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + events = _generate_events(job) + return { + "count": len(events), + "next": None, + "previous": None, + "results": events[:page_size], + } + + +@mcp.tool() +def jobs_job_host_summaries_list(id: int) -> dict: + """Get per-host execution summaries for a job run. + + Args: + id: Job ID. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + summaries = _generate_host_summaries(job) + return { + "count": len(summaries), + "next": None, + "previous": None, + "results": summaries, + } + + +@mcp.tool() +def jobs_relaunch_retrieve( + id: int, + hosts: str = "all", + job_type: str = "run", +) -> dict: + """Relaunch a previously completed or failed job. + + Args: + id: Original job ID to relaunch. + hosts: Which hosts to target ('all' or 'failed'). + job_type: Job type for relaunch ('run' or 'check'). + """ + global _next_job_id + original = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not original: + return {"detail": f"Not found. Job {id} does not exist."} + + new_id = _next_job_id + _next_job_id += 1 + + new_job = { + **original, + "id": new_id, + "job_type": job_type, + "status": "successful", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": _ts(timedelta(seconds=-300)), + "elapsed": 300.0, + "launch_type": "relaunch", + } + MOCK_JOBS.append(new_job) + + return { + "job": new_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{new_id}/", + } + + +# --------------------------------------------------------------------------- +# MCP Tools: Inventory Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def inventories_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available inventories in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter inventories. + """ + results = MOCK_INVENTORIES + if search: + s = search.lower() + results = [inv for inv in results if s in inv["name"].lower() or s in inv.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def hosts_list( + inventory_id: Optional[int] = None, + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List hosts in an inventory. + + Args: + inventory_id: Filter by inventory ID. If not provided, lists hosts from all inventories. + page_size: Number of results per page. + search: Optional search string to filter hosts by name. + """ + inv_id = inventory_id or 1 + hosts = _generate_hosts(inv_id) + if search: + s = search.lower() + hosts = [h for h in hosts if s in h["name"].lower()] + return { + "count": len(hosts), + "next": None if len(hosts) <= page_size else f"/api/controller/v2/hosts/?page=2", + "previous": None, + "results": hosts[:page_size], + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-creator/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__job-template-creator/instruction.md b/evaluation/without_skills/rh-sre__job-template-creator/instruction.md new file mode 100644 index 00000000..77c24f2b --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-creator/instruction.md @@ -0,0 +1,17 @@ +# Job Template Creator Task + +You are a Red Hat SRE. A remediation playbook has been written for a critical CVE, and you need to set up an Ansible Automation Platform job template so the team can run it against affected systems. + +## Scenario +The security team delivered a remediation playbook for CVE-2026-1234. You need to create a job template in AAP that the operations team can use to run this playbook against production hosts. + +## Requirements +- Check which projects and inventories are available in AAP +- Determine the correct project, inventory, and credentials for the remediation playbook +- Document the job template configuration: name, playbook path, inventory, project, credentials, and execution settings (privilege escalation, variable prompts, limit prompts) +- Explain any decisions about template settings (e.g., why `become` is enabled, whether to prompt for variables at launch) +- If template creation requires manual steps (e.g., via the AAP Web UI), document those steps clearly + +Document your methodology, plan, and configuration in `/root/report.md`. + +Use MCP tools to query AAP. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-sre__job-template-creator/solution/solve.sh b/evaluation/without_skills/rh-sre__job-template-creator/solution/solve.sh new file mode 100644 index 00000000..ec9c5b02 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-creator/solution/solve.sh @@ -0,0 +1,19 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Job Template Creation + +## Template Fields +- Inventory: production-systems +- Project: remediation-playbooks +- Playbook: playbooks/remediation/cve-2024-12345.yml +- Credentials: machine-credential +- become_enabled: true + +## Prompt on Launch +- Job Type (REQUIRED for dry-run + run) +- Variables +- Limit + +## Note +No job_templates_create API in AAP MCP. Create via Web UI. Execute mcp-aap-validator before operations. +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__job-template-creator/task.toml b/evaluation/without_skills/rh-sre__job-template-creator/task.toml new file mode 100644 index 00000000..bc2620fa --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-creator/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__job-template-creator" +name = "rh-sre AAP Job Template Creation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "job-template-creator", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__job-template-creator/tests/llm_judge.py b/evaluation/without_skills/rh-sre__job-template-creator/tests/llm_judge.py new file mode 100644 index 00000000..54c93ce1 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-creator/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "no_create_tool", "file": "/root/report.md", "question": "Does the report acknowledge that AAP MCP has no create/update tools and template creation must be done via Web UI?", "reference": "A skilled report notes the MCP limitation and directs to Web UI. An unskilled report attempts to create templates via API."}, + {"id": "playbook_path_and_git", "file": "/root/report.md", "question": "Does the report require the playbook to be in a Git repo with proper path convention before template creation?", "reference": "A skilled report follows playbooks/remediation/ path convention. An unskilled report skips Git integration."}, + {"id": "launch_configuration", "file": "/root/report.md", "question": "Does the report configure prompt-on-launch for job type and privilege escalation?", "reference": "A skilled report enables prompt-on-launch and become_enabled. An unskilled report skips these configuration details."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__job-template-creator/tests/test.sh b/evaluation/without_skills/rh-sre__job-template-creator/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-creator/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__job-template-creator/tests/test_outputs.py b/evaluation/without_skills/rh-sre__job-template-creator/tests/test_outputs.py new file mode 100644 index 00000000..53140085 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-creator/tests/test_outputs.py @@ -0,0 +1,98 @@ +""" +Tests for rh-sre__job-template-creator per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['job template', 'template', 'ansible']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_git_before_template(self): + """Skill: Playbook must be in Git repo before template creation; AAP syncs from project.""" + c = read_report().lower() + has_git = any(t in c for t in ["git", "commit", "push", "repository", "sync"]) + has_project = any(t in c for t in ["project", "scm", "sync"]) + assert has_git or has_project, ( + "should add playbook to Git before template (skill: Phase 1)" + ) + + def test_manual_creation_required(self): + """Skill teaches template creation requires manual steps (e.g., Web UI) + because the automation API is read-only for templates.""" + c = read_report().lower() + assert any(t in c for t in [ + "web ui", "manual", "read-only", "cannot create", + "no create", "gui", "interface", + ]), "should acknowledge template creation requires manual steps" + + def test_playbook_path_convention(self): + """Skill teaches following a consistent directory structure or location + convention for remediation playbooks.""" + c = read_report().lower() + assert any(t in c for t in [ + "playbook path", "remediation playbook", "playbook location", + "playbook directory", "playbook structure", + ]), "should follow a playbook path convention for remediation" + + def test_privilege_escalation_required(self): + """Skill: become_enabled required for remediation (package updates).""" + c = read_report().lower() + assert any(t in c for t in ["privilege", "become", "sudo", "escalat", "root"]), ( + "should require privilege escalation (skill: required for package updates)" + ) + + def test_launch_prompts(self): + """Skill: Prompt on Launch for Job Type, Variables, Limit.""" + c = read_report().lower() + assert any(t in c for t in ["launch", "prompt", "variable", "limit", "job type"]), ( + "should configure prompt on launch (skill: Phase 4)" + ) + + def test_configurable_variables(self): + """Docs teach configuring variables for CVE targeting, remediation mode, + and post-remediation verification. Without docs, agents skip variable design.""" + c = read_report().lower() + concepts = sum(1 for t in [ + "target_cve", "cve", "remediation_mode", "mode", + "verify_after", "verification", "extra_var", "extra var", + "variable", "parameter", + ] if t in c) + assert concepts >= 3, ( + "should define configurable variables for CVE targeting, " + "remediation mode, and verification" + ) + + def test_version_control_sync(self): + """Skill teaches AAP projects sync playbooks from version control. + Without skill, agents describe playbook management without + version-control-backed project sync.""" + c = read_report().lower() + assert any(t in c for t in [ + "scm", "source control", "version control", + "repository sync", "git-backed", "git sync", + ]), "should reference version control sync for AAP project playbooks" diff --git a/evaluation/without_skills/rh-sre__job-template-remediation-validator/environment/Dockerfile b/evaluation/without_skills/rh-sre__job-template-remediation-validator/environment/Dockerfile new file mode 100644 index 00000000..51ce02e5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-remediation-validator/environment/Dockerfile @@ -0,0 +1,47 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + }, \ + "aap-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-aap-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-aap-mcp.py b/evaluation/without_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-aap-mcp.py new file mode 100644 index 00000000..d8ae4fd5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-aap-mcp.py @@ -0,0 +1,1048 @@ +#!/usr/bin/env python3 +""" +Mock AAP (Ansible Automation Platform) MCP Server + +Simulates the AAP MCP gateway for per-skill evaluation tasks. Implements +the full set of tools used by rh-sre skills: + - job_templates_list / job_templates_retrieve + - projects_list + - job_templates_launch_retrieve + - jobs_retrieve / jobs_stdout_retrieve + - jobs_job_events_list / jobs_job_host_summaries_list + - jobs_relaunch_retrieve + - inventories_list / hosts_list + +Data mirrors a realistic AAP deployment: + - 6 job templates (3 remediation, 1 compliance, 1 patching, 1 reporting) + - 3 projects (remediation, compliance, reporting) + - 3 inventories (production 30 hosts, staging 15 hosts, all-managed 63 hosts) + - 12 recent jobs with varied statuses + +Follows the same mock-server pattern as mock-lightspeed-mcp.py. +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +mcp = FastMCP("aap-mcp") + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +def _ts(delta: timedelta) -> str: + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +# --------------------------------------------------------------------------- +# Mock data: Projects +# --------------------------------------------------------------------------- + +MOCK_PROJECTS = [ + { + "id": 6, + "type": "project", + "name": "Remediation Playbooks", + "description": "CVE and security remediation playbooks managed via Git", + "scm_type": "git", + "scm_url": "https://github.com/org/remediation-playbooks.git", + "scm_branch": "main", + "scm_revision": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2", + "status": "successful", + "last_job_run": _ts(timedelta(hours=2)), + "last_update_failed": False, + "created": _ts(timedelta(days=90)), + "modified": _ts(timedelta(hours=2)), + }, + { + "id": 7, + "type": "project", + "name": "Compliance Checks", + "description": "STIG and CIS compliance scanning playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/compliance-playbooks.git", + "scm_branch": "main", + "scm_revision": "b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3", + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "last_update_failed": False, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 8, + "type": "project", + "name": "Fleet Reporting", + "description": "System inventory and health reporting playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/fleet-reports.git", + "scm_branch": "main", + "scm_revision": "c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4", + "status": "successful", + "last_job_run": _ts(timedelta(days=3)), + "last_update_failed": False, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=3)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Inventories & Hosts +# --------------------------------------------------------------------------- + +MOCK_INVENTORIES = [ + { + "id": 1, + "type": "inventory", + "name": "Production Systems", + "description": "All production RHEL systems across data centers", + "total_hosts": 30, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 5, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 2, + "type": "inventory", + "name": "Staging Systems", + "description": "Pre-production staging environment", + "total_hosts": 15, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 3, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=300)), + "modified": _ts(timedelta(days=7)), + }, + { + "id": 3, + "type": "inventory", + "name": "All Managed Systems", + "description": "Complete fleet: production, staging, development, QA, legacy", + "total_hosts": 63, + "has_active_failures": True, + "hosts_with_active_failures": 2, + "total_groups": 8, + "groups_with_active_failures": 1, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(hours=6)), + }, +] + + +def _generate_hosts(inventory_id: int) -> list[dict]: + """Generate realistic hosts for an inventory.""" + hosts: list[dict] = [] + if inventory_id == 1: + roles = ["web", "db", "app", "lb", "monitoring", "cache"] + for i, role in enumerate(roles): + for j in range(5 if role in ("web", "app") else 4 if role == "db" else 3 if role == "monitoring" else 2): + hosts.append({ + "id": len(hosts) + 1, + "type": "host", + "name": f"{role}-{j+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "production", "role": "{role}"}}', + }) + if len(hosts) >= 30: + break + if len(hosts) >= 30: + break + elif inventory_id == 2: + for i in range(15): + role = ["web", "db", "app"][i % 3] + hosts.append({ + "id": 100 + i, + "type": "host", + "name": f"{role}-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging", "role": "{role}"}}', + }) + elif inventory_id == 3: + for i in range(30): + hosts.append({ + "id": 200 + i, + "type": "host", + "name": f"host-{i+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": i in (45, 58), + "variables": f'{{"rhel_version": "9.3", "environment": "production"}}', + }) + for i in range(15): + hosts.append({ + "id": 230 + i, + "type": "host", + "name": f"host-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging"}}', + }) + for i in range(10): + hosts.append({ + "id": 245 + i, + "type": "host", + "name": f"dev-{i+1:02d}.dev.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "8.9", "environment": "development"}}', + }) + for i in range(5): + hosts.append({ + "id": 255 + i, + "type": "host", + "name": f"qa-{i+1:02d}.qa.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.2", "environment": "qa"}}', + }) + for i in range(3): + hosts.append({ + "id": 260 + i, + "type": "host", + "name": f"legacy-{i+1:02d}.corp.example.com", + "inventory": inventory_id, + "enabled": i < 2, + "has_active_failures": i == 2, + "variables": f'{{"rhel_version": "7.9", "environment": "legacy"}}', + }) + return hosts + + +# --------------------------------------------------------------------------- +# Mock data: Job Templates +# --------------------------------------------------------------------------- + +MOCK_JOB_TEMPLATES = [ + { + "id": 10, + "type": "job_template", + "name": "CVE Remediation - Kernel Update", + "description": "Kernel update with boom snapshot for rollback safety", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "successful", + "last_job_run": _ts(timedelta(hours=4)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1001, "status": "successful", "finished": _ts(timedelta(hours=4))}, + }, + "created": _ts(timedelta(days=60)), + "modified": _ts(timedelta(days=2)), + }, + { + "id": 11, + "type": "job_template", + "name": "CVE Remediation - Package Update", + "description": "General package update for CVE remediation with needs-restarting check", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "check", + "verbosity": 1, + "timeout": 1800, + "forks": 10, + "status": "successful", + "last_job_run": _ts(timedelta(hours=12)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1005, "status": "successful", "finished": _ts(timedelta(hours=12))}, + }, + "created": _ts(timedelta(days=45)), + "modified": _ts(timedelta(days=5)), + }, + { + "id": 12, + "type": "job_template", + "name": "CVE Remediation - Generic", + "description": "Generic CVE remediation template for ad-hoc patches", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-remediation.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "never updated", + "last_job_run": None, + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + }, + "created": _ts(timedelta(days=30)), + "modified": _ts(timedelta(days=30)), + }, + { + "id": 20, + "type": "job_template", + "name": "Compliance Check - STIG", + "description": "Run STIG compliance scan across fleet", + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": False, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 7200, + "forks": 20, + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "summary_fields": { + "project": {"id": 7, "name": "Compliance Checks", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 2, "name": "compliance-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1010, "status": "successful", "finished": _ts(timedelta(days=1))}, + }, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=14)), + }, + { + "id": 25, + "type": "job_template", + "name": "Emergency Patching", + "description": "Emergency patch application — NO become enabled (misconfigured)", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": False, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 600, + "forks": 25, + "status": "failed", + "last_job_run": _ts(timedelta(days=7)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1020, "status": "failed", "finished": _ts(timedelta(days=7))}, + }, + "created": _ts(timedelta(days=200)), + "modified": _ts(timedelta(days=200)), + }, + { + "id": 30, + "type": "job_template", + "name": "Fleet Health Report", + "description": "Generate fleet health and inventory report", + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": True, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 1800, + "forks": 30, + "status": "successful", + "last_job_run": _ts(timedelta(hours=6)), + "summary_fields": { + "project": {"id": 8, "name": "Fleet Reporting", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1025, "status": "successful", "finished": _ts(timedelta(hours=6))}, + }, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=14)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Jobs (recent runs) +# --------------------------------------------------------------------------- + +PROD_HOSTS = [ + "web-01.prod.example.com", + "web-02.prod.example.com", + "db-01.prod.example.com", + "db-02.prod.example.com", + "app-01.prod.example.com", + "app-02.prod.example.com", +] + +MOCK_JOBS = [ + { + "id": 1001, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "check", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=4, minutes=30)), + "finished": _ts(timedelta(hours=4)), + "elapsed": 1800.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1002, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=3, minutes=45)), + "finished": _ts(timedelta(hours=3)), + "elapsed": 2700.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1005, + "type": "job", + "name": "CVE Remediation - Package Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=12, minutes=20)), + "finished": _ts(timedelta(hours=12)), + "elapsed": 1200.0, + "job_template": 11, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "limit": "", + "extra_vars": '{"target_cve": "CVE-2024-54321"}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 11, "name": "CVE Remediation - Package Update"}, + }, + }, + { + "id": 1010, + "type": "job", + "name": "Compliance Check - STIG", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(days=1, hours=2)), + "finished": _ts(timedelta(days=1)), + "elapsed": 7200.0, + "job_template": 20, + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 20, "name": "Compliance Check - STIG"}, + }, + }, + { + "id": 1020, + "type": "job", + "name": "Emergency Patching", + "job_type": "run", + "status": "failed", + "failed": True, + "started": _ts(timedelta(days=7, hours=1)), + "finished": _ts(timedelta(days=7)), + "elapsed": 3600.0, + "job_template": 25, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 25, "name": "Emergency Patching"}, + }, + }, + { + "id": 1025, + "type": "job", + "name": "Fleet Health Report", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=6, minutes=30)), + "finished": _ts(timedelta(hours=6)), + "elapsed": 1800.0, + "job_template": 30, + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 30, "name": "Fleet Health Report"}, + }, + }, +] + +_next_job_id = 2000 + + +# --------------------------------------------------------------------------- +# Mock stdout generators +# --------------------------------------------------------------------------- + +def _generate_stdout(job: dict) -> str: + """Generate realistic Ansible playbook stdout for a job.""" + playbook_name = job.get("name", "Unknown") + job_type = job.get("job_type", "run") + status = job.get("status", "successful") + limit = job.get("limit", "") + hosts = limit.split(",") if limit else PROD_HOSTS[:3] + hosts = [h.strip() for h in hosts if h.strip()] + extra_vars = job.get("extra_vars", "{}") + mode = " (CHECK MODE)" if job_type == "check" else "" + + lines = [] + lines.append(f"PLAY [{playbook_name}] *****") + lines.append("") + + lines.append(f"TASK [Gathering Facts{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}]") + lines.append("") + + if "kernel" in playbook_name.lower(): + lines.append(f"TASK [Create boom snapshot for rollback{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}] => {{\"msg\": \"boom create --title pre-remediation-CVE-2024-12345\"}}") + lines.append("") + + lines.append(f"TASK [Check disk space for kernel update{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}] => {{\"msg\": \"Disk space OK: 45% used\"}}") + lines.append("") + + lines.append(f"TASK [Update kernel package{mode}] *****") + for h in hosts: + result = "changed" if status == "successful" else "fatal" + if result == "changed": + lines.append(f'changed: [{h}] => {{"msg": "kernel-5.14.0-362.24.1.el9_3 -> kernel-5.14.0-362.24.2.el9_3"}}') + else: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Permission denied", "rc": 1}}') + lines.append("") + + lines.append(f"TASK [Check if reboot is needed (needs-restarting -r){mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"rc": 1, "msg": "Reboot is required to fully utilize updates."}}') + lines.append("") + + elif "package" in playbook_name.lower(): + lines.append(f"TASK [Update target packages for CVE remediation{mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"msg": "httpd-2.4.53-7.el9 -> httpd-2.4.57-8.el9"}}') + lines.append("") + + lines.append(f"TASK [Restart affected services{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append(f"TASK [Verify service health{mode}] *****") + for h in hosts: + lines.append(f'ok: [{h}] => {{"msg": "Service httpd is running"}}') + lines.append("") + + elif "emergency" in playbook_name.lower() and status == "failed": + lines.append(f"TASK [Apply emergency patch{mode}] *****") + for h in hosts: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Missing sudo password (become_enabled not set)", "rc": 1}}') + lines.append("") + lines.append("NO MORE HOSTS LEFT *****") + lines.append("") + + else: + lines.append(f"TASK [Execute playbook tasks{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append("PLAY RECAP *****") + for h in hosts: + if status == "successful": + ok_count = random.randint(3, 6) + changed_count = random.randint(1, 3) + lines.append(f"{h:<45} : ok={ok_count} changed={changed_count} unreachable=0 failed=0 skipped=0 rescued=0 ignored=0") + elif status == "failed": + lines.append(f"{h:<45} : ok=1 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0") + lines.append("") + + return "\n".join(lines) + + +def _generate_events(job: dict) -> list[dict]: + """Generate realistic Ansible task events for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + events: list[dict] = [] + eid = 1 + + task_names = ["Gathering Facts"] + if "kernel" in job.get("name", "").lower(): + task_names += [ + "Create boom snapshot for rollback", + "Check disk space for kernel update", + "Update kernel package", + "Check if reboot is needed (needs-restarting -r)", + ] + elif "package" in job.get("name", "").lower(): + task_names += [ + "Update target packages for CVE remediation", + "Restart affected services", + "Verify service health", + ] + else: + task_names += ["Execute playbook tasks"] + + for task_name in task_names: + for host in hosts: + is_failed = job.get("status") == "failed" and task_name != "Gathering Facts" + events.append({ + "id": eid, + "type": "job_event", + "event": "runner_on_ok" if not is_failed else "runner_on_failed", + "task": task_name, + "host": host, + "host_name": host, + "play": job.get("name", ""), + "changed": task_name != "Gathering Facts" and not is_failed, + "failed": is_failed, + "event_data": { + "task": task_name, + "host": host, + "res": { + "changed": task_name != "Gathering Facts" and not is_failed, + "msg": "Task completed" if not is_failed else "Permission denied", + }, + }, + "created": _ts(timedelta(hours=4, minutes=30 - eid)), + }) + eid += 1 + + return events + + +def _generate_host_summaries(job: dict) -> list[dict]: + """Generate per-host summaries for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + summaries: list[dict] = [] + + for i, host in enumerate(hosts): + is_failed = job.get("status") == "failed" + summaries.append({ + "id": i + 1, + "type": "job_host_summary", + "host": i + 1, + "host_name": host, + "ok": 1 if is_failed else random.randint(3, 6), + "changed": 0 if is_failed else random.randint(1, 3), + "dark": 0, + "failures": 1 if is_failed else 0, + "skipped": 0, + "processed": 1, + "failed": is_failed, + }) + + return summaries + + +# --------------------------------------------------------------------------- +# MCP Tools: Job Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def job_templates_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available job templates in AAP. + + Args: + page_size: Number of results per page (default 10, max 200). + search: Optional search string to filter templates by name. + """ + results = MOCK_JOB_TEMPLATES + if search: + s = search.lower() + results = [t for t in results if s in t["name"].lower() or s in t.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_retrieve(id: str) -> dict: + """Retrieve detailed information about a specific job template. + + Args: + id: Job template ID (as string). + """ + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + return template + + +@mcp.tool() +def projects_list( + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List available projects in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter projects by name. + """ + results = MOCK_PROJECTS + if search: + s = search.lower() + results = [p for p in results if s in p["name"].lower() or s in p.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_launch_retrieve( + id: str, + requestBody: Optional[dict] = None, +) -> dict: + """Launch a job from a job template. + + Args: + id: Job template ID to launch. + requestBody: Optional launch parameters including job_type ('run' or 'check'), + extra_vars (dict), and limit (comma-separated host list). + """ + global _next_job_id + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + + body = requestBody or {} + job_type = body.get("job_type", template.get("job_type", "run")) + + if not template.get("ask_job_type_on_launch") and job_type != template.get("job_type"): + return { + "error": f"Cannot override job_type: ask_job_type_on_launch is disabled on template {id}", + } + + job_id = _next_job_id + _next_job_id += 1 + + new_job = { + "id": job_id, + "type": "job", + "name": template["name"], + "job_type": job_type, + "status": "pending", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": None, + "elapsed": 0.0, + "job_template": tid, + "inventory": template["inventory"], + "project": template["project"], + "playbook": template["playbook"], + "limit": body.get("limit", ""), + "extra_vars": str(body.get("extra_vars", {})), + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": tid, "name": template["name"]}, + }, + } + MOCK_JOBS.append(new_job) + + # Simulate job completion after launch + new_job["status"] = "successful" + new_job["finished"] = _ts(timedelta(seconds=-300)) + new_job["elapsed"] = 300.0 + + return { + "job": job_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{job_id}/", + "related": { + "stdout": f"/api/controller/v2/jobs/{job_id}/stdout/", + "job_events": f"/api/controller/v2/jobs/{job_id}/job_events/", + "job_host_summaries": f"/api/controller/v2/jobs/{job_id}/job_host_summaries/", + }, + } + + +@mcp.tool() +def jobs_retrieve(id: int) -> dict: + """Get the status and details of a job run. + + Args: + id: Job ID to retrieve. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return job + + +@mcp.tool() +def jobs_list(page_size: int = 10) -> dict: + """List recent job runs. + + Args: + page_size: Number of results to return. + """ + results = sorted(MOCK_JOBS, key=lambda j: j.get("started", ""), reverse=True) + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def jobs_stdout_retrieve(id: int, format: str = "txt") -> dict: + """Get the stdout (console output) from a job run. + + Args: + id: Job ID. + format: Output format ('txt' or 'json'). Default 'txt'. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return { + "content": _generate_stdout(job), + "range": {"start": 0, "end": 1}, + } + + +@mcp.tool() +def jobs_job_events_list(id: int, page_size: int = 50) -> dict: + """Get task-level events for a job run. + + Args: + id: Job ID. + page_size: Number of events to return. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + events = _generate_events(job) + return { + "count": len(events), + "next": None, + "previous": None, + "results": events[:page_size], + } + + +@mcp.tool() +def jobs_job_host_summaries_list(id: int) -> dict: + """Get per-host execution summaries for a job run. + + Args: + id: Job ID. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + summaries = _generate_host_summaries(job) + return { + "count": len(summaries), + "next": None, + "previous": None, + "results": summaries, + } + + +@mcp.tool() +def jobs_relaunch_retrieve( + id: int, + hosts: str = "all", + job_type: str = "run", +) -> dict: + """Relaunch a previously completed or failed job. + + Args: + id: Original job ID to relaunch. + hosts: Which hosts to target ('all' or 'failed'). + job_type: Job type for relaunch ('run' or 'check'). + """ + global _next_job_id + original = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not original: + return {"detail": f"Not found. Job {id} does not exist."} + + new_id = _next_job_id + _next_job_id += 1 + + new_job = { + **original, + "id": new_id, + "job_type": job_type, + "status": "successful", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": _ts(timedelta(seconds=-300)), + "elapsed": 300.0, + "launch_type": "relaunch", + } + MOCK_JOBS.append(new_job) + + return { + "job": new_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{new_id}/", + } + + +# --------------------------------------------------------------------------- +# MCP Tools: Inventory Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def inventories_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available inventories in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter inventories. + """ + results = MOCK_INVENTORIES + if search: + s = search.lower() + results = [inv for inv in results if s in inv["name"].lower() or s in inv.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def hosts_list( + inventory_id: Optional[int] = None, + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List hosts in an inventory. + + Args: + inventory_id: Filter by inventory ID. If not provided, lists hosts from all inventories. + page_size: Number of results per page. + search: Optional search string to filter hosts by name. + """ + inv_id = inventory_id or 1 + hosts = _generate_hosts(inv_id) + if search: + s = search.lower() + hosts = [h for h in hosts if s in h["name"].lower()] + return { + "count": len(hosts), + "next": None if len(hosts) <= page_size else f"/api/controller/v2/hosts/?page=2", + "previous": None, + "results": hosts[:page_size], + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-remediation-validator/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__job-template-remediation-validator/instruction.md b/evaluation/without_skills/rh-sre__job-template-remediation-validator/instruction.md new file mode 100644 index 00000000..55b78ca1 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-remediation-validator/instruction.md @@ -0,0 +1,18 @@ +# Job Template Validation Task + +You are a Red Hat SRE. Before running a CVE remediation playbook through AAP, you need to verify that the job template is correctly configured and safe to execute. + +## Scenario +The team wants to use an existing AAP job template to remediate a critical vulnerability. Before giving the green light, you need to confirm the template meets all requirements for a safe remediation run. + +## Requirements +- Retrieve the job template configuration from AAP +- Verify required fields are set: inventory, project, playbook, credentials, and privilege escalation +- Check recommended settings: whether the template prompts for variables, limit, and inventory at launch +- Verify the referenced project and inventory actually exist in AAP +- Produce a pass/warn/fail assessment for each configuration item +- Summarize whether the template is ready for production remediation use + +Document your methodology, validation results, and assessment in `/root/report.md`. + +Use MCP tools to query AAP. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-sre__job-template-remediation-validator/solution/solve.sh b/evaluation/without_skills/rh-sre__job-template-remediation-validator/solution/solve.sh new file mode 100644 index 00000000..6e9ff39d --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-remediation-validator/solution/solve.sh @@ -0,0 +1,21 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Job Template Validation + +## Required Checks +| Field | Expected | Status | +|-------|----------|--------| +| ask_job_type_on_launch | true | ✅ | +| become_enabled | true | ✅ | +| credentials | present | ✅ | +| inventory | present | ✅ | +| project | present | ✅ | +| playbook | present | ✅ | + +## Recommended +- ask_variables_on_launch: true +- ask_limit_on_launch: true + +## Overall +✓ PASSED - Template ready for remediation playbook execution. +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__job-template-remediation-validator/task.toml b/evaluation/without_skills/rh-sre__job-template-remediation-validator/task.toml new file mode 100644 index 00000000..2b6428ba --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-remediation-validator/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__job-template-remediation-validator" +name = "rh-sre Job Template Validation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "job-template-remediation-validator", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__job-template-remediation-validator/tests/llm_judge.py b/evaluation/without_skills/rh-sre__job-template-remediation-validator/tests/llm_judge.py new file mode 100644 index 00000000..106f21c9 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-remediation-validator/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "ask_job_type_required", "file": "/root/report.md", "question": "Does the report require ask_job_type_on_launch: true for dual check/run mode support?", "reference": "A skilled report requires this for dry-run vs run flexibility. An unskilled report doesn't validate this field."}, + {"id": "become_and_credentials", "file": "/root/report.md", "question": "Does the report validate both become_enabled and credentials (checking summary_fields.credentials or credentials array)?", "reference": "A skilled report checks both credential locations. An unskilled report checks only one."}, + {"id": "required_vs_recommended", "file": "/root/report.md", "question": "Does the report distinguish required fields (inventory, project, playbook, credentials, become, ask_job_type) from recommended (ask_variables, ask_limit)?", "reference": "A skilled report categorizes validation checks by priority. An unskilled report treats all checks equally."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__job-template-remediation-validator/tests/test.sh b/evaluation/without_skills/rh-sre__job-template-remediation-validator/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-remediation-validator/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__job-template-remediation-validator/tests/test_outputs.py b/evaluation/without_skills/rh-sre__job-template-remediation-validator/tests/test_outputs.py new file mode 100644 index 00000000..b39c5886 --- /dev/null +++ b/evaluation/without_skills/rh-sre__job-template-remediation-validator/tests/test_outputs.py @@ -0,0 +1,63 @@ +""" +Tests for rh-sre__job-template-remediation-validator per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['valid', 'job template', 'check']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_ask_job_type_on_launch(self): + """Skill teaches ask_job_type_on_launch: true is required for check vs run modes.""" + c = read_report().lower() + assert any(t in c for t in ["ask_job_type", "ask_job_type_on_launch"]), ( + "should require ask_job_type_on_launch (skill: for check vs run)" + ) + + def test_credentials_check_both_fields(self): + """Skill teaches credentials may be in summary_fields.credentials OR credentials array.""" + c = read_report().lower() + assert any(t in c for t in ["summary_fields", "credentials array", "both"]), ( + "should check credentials in summary_fields or credentials array (skill-specific)" + ) + + def test_become_enabled_required(self): + """Skill: become_enabled required for package updates.""" + c = read_report().lower() + assert any(t in c for t in ["become", "privilege", "escalat", "sudo"]), ( + "should require privilege escalation (skill: required for remediation)" + ) + + def test_required_vs_recommended(self): + """Skill: Distinguish required (inventory, project, playbook, credentials, become, ask_job_type) vs recommended (ask_variables, ask_limit).""" + c = read_report().lower() + has_required = any(t in c for t in ["required", "must", "inventory", "project", "playbook"]) + has_recommended = any(t in c for t in ["recommended", "warn", "variable", "limit"]) + assert has_required or has_recommended, ( + "should distinguish required vs recommended checks (skill: Phase 2 vs 3)" + ) diff --git a/evaluation/without_skills/rh-sre__mcp-aap-validator/environment/Dockerfile b/evaluation/without_skills/rh-sre__mcp-aap-validator/environment/Dockerfile new file mode 100644 index 00000000..51ce02e5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-aap-validator/environment/Dockerfile @@ -0,0 +1,47 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + }, \ + "aap-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-aap-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-aap-mcp.py b/evaluation/without_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-aap-mcp.py new file mode 100644 index 00000000..d8ae4fd5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-aap-mcp.py @@ -0,0 +1,1048 @@ +#!/usr/bin/env python3 +""" +Mock AAP (Ansible Automation Platform) MCP Server + +Simulates the AAP MCP gateway for per-skill evaluation tasks. Implements +the full set of tools used by rh-sre skills: + - job_templates_list / job_templates_retrieve + - projects_list + - job_templates_launch_retrieve + - jobs_retrieve / jobs_stdout_retrieve + - jobs_job_events_list / jobs_job_host_summaries_list + - jobs_relaunch_retrieve + - inventories_list / hosts_list + +Data mirrors a realistic AAP deployment: + - 6 job templates (3 remediation, 1 compliance, 1 patching, 1 reporting) + - 3 projects (remediation, compliance, reporting) + - 3 inventories (production 30 hosts, staging 15 hosts, all-managed 63 hosts) + - 12 recent jobs with varied statuses + +Follows the same mock-server pattern as mock-lightspeed-mcp.py. +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +mcp = FastMCP("aap-mcp") + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +def _ts(delta: timedelta) -> str: + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +# --------------------------------------------------------------------------- +# Mock data: Projects +# --------------------------------------------------------------------------- + +MOCK_PROJECTS = [ + { + "id": 6, + "type": "project", + "name": "Remediation Playbooks", + "description": "CVE and security remediation playbooks managed via Git", + "scm_type": "git", + "scm_url": "https://github.com/org/remediation-playbooks.git", + "scm_branch": "main", + "scm_revision": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2", + "status": "successful", + "last_job_run": _ts(timedelta(hours=2)), + "last_update_failed": False, + "created": _ts(timedelta(days=90)), + "modified": _ts(timedelta(hours=2)), + }, + { + "id": 7, + "type": "project", + "name": "Compliance Checks", + "description": "STIG and CIS compliance scanning playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/compliance-playbooks.git", + "scm_branch": "main", + "scm_revision": "b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3", + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "last_update_failed": False, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 8, + "type": "project", + "name": "Fleet Reporting", + "description": "System inventory and health reporting playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/fleet-reports.git", + "scm_branch": "main", + "scm_revision": "c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4", + "status": "successful", + "last_job_run": _ts(timedelta(days=3)), + "last_update_failed": False, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=3)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Inventories & Hosts +# --------------------------------------------------------------------------- + +MOCK_INVENTORIES = [ + { + "id": 1, + "type": "inventory", + "name": "Production Systems", + "description": "All production RHEL systems across data centers", + "total_hosts": 30, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 5, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 2, + "type": "inventory", + "name": "Staging Systems", + "description": "Pre-production staging environment", + "total_hosts": 15, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 3, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=300)), + "modified": _ts(timedelta(days=7)), + }, + { + "id": 3, + "type": "inventory", + "name": "All Managed Systems", + "description": "Complete fleet: production, staging, development, QA, legacy", + "total_hosts": 63, + "has_active_failures": True, + "hosts_with_active_failures": 2, + "total_groups": 8, + "groups_with_active_failures": 1, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(hours=6)), + }, +] + + +def _generate_hosts(inventory_id: int) -> list[dict]: + """Generate realistic hosts for an inventory.""" + hosts: list[dict] = [] + if inventory_id == 1: + roles = ["web", "db", "app", "lb", "monitoring", "cache"] + for i, role in enumerate(roles): + for j in range(5 if role in ("web", "app") else 4 if role == "db" else 3 if role == "monitoring" else 2): + hosts.append({ + "id": len(hosts) + 1, + "type": "host", + "name": f"{role}-{j+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "production", "role": "{role}"}}', + }) + if len(hosts) >= 30: + break + if len(hosts) >= 30: + break + elif inventory_id == 2: + for i in range(15): + role = ["web", "db", "app"][i % 3] + hosts.append({ + "id": 100 + i, + "type": "host", + "name": f"{role}-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging", "role": "{role}"}}', + }) + elif inventory_id == 3: + for i in range(30): + hosts.append({ + "id": 200 + i, + "type": "host", + "name": f"host-{i+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": i in (45, 58), + "variables": f'{{"rhel_version": "9.3", "environment": "production"}}', + }) + for i in range(15): + hosts.append({ + "id": 230 + i, + "type": "host", + "name": f"host-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging"}}', + }) + for i in range(10): + hosts.append({ + "id": 245 + i, + "type": "host", + "name": f"dev-{i+1:02d}.dev.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "8.9", "environment": "development"}}', + }) + for i in range(5): + hosts.append({ + "id": 255 + i, + "type": "host", + "name": f"qa-{i+1:02d}.qa.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.2", "environment": "qa"}}', + }) + for i in range(3): + hosts.append({ + "id": 260 + i, + "type": "host", + "name": f"legacy-{i+1:02d}.corp.example.com", + "inventory": inventory_id, + "enabled": i < 2, + "has_active_failures": i == 2, + "variables": f'{{"rhel_version": "7.9", "environment": "legacy"}}', + }) + return hosts + + +# --------------------------------------------------------------------------- +# Mock data: Job Templates +# --------------------------------------------------------------------------- + +MOCK_JOB_TEMPLATES = [ + { + "id": 10, + "type": "job_template", + "name": "CVE Remediation - Kernel Update", + "description": "Kernel update with boom snapshot for rollback safety", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "successful", + "last_job_run": _ts(timedelta(hours=4)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1001, "status": "successful", "finished": _ts(timedelta(hours=4))}, + }, + "created": _ts(timedelta(days=60)), + "modified": _ts(timedelta(days=2)), + }, + { + "id": 11, + "type": "job_template", + "name": "CVE Remediation - Package Update", + "description": "General package update for CVE remediation with needs-restarting check", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "check", + "verbosity": 1, + "timeout": 1800, + "forks": 10, + "status": "successful", + "last_job_run": _ts(timedelta(hours=12)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1005, "status": "successful", "finished": _ts(timedelta(hours=12))}, + }, + "created": _ts(timedelta(days=45)), + "modified": _ts(timedelta(days=5)), + }, + { + "id": 12, + "type": "job_template", + "name": "CVE Remediation - Generic", + "description": "Generic CVE remediation template for ad-hoc patches", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-remediation.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "never updated", + "last_job_run": None, + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + }, + "created": _ts(timedelta(days=30)), + "modified": _ts(timedelta(days=30)), + }, + { + "id": 20, + "type": "job_template", + "name": "Compliance Check - STIG", + "description": "Run STIG compliance scan across fleet", + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": False, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 7200, + "forks": 20, + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "summary_fields": { + "project": {"id": 7, "name": "Compliance Checks", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 2, "name": "compliance-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1010, "status": "successful", "finished": _ts(timedelta(days=1))}, + }, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=14)), + }, + { + "id": 25, + "type": "job_template", + "name": "Emergency Patching", + "description": "Emergency patch application — NO become enabled (misconfigured)", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": False, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 600, + "forks": 25, + "status": "failed", + "last_job_run": _ts(timedelta(days=7)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1020, "status": "failed", "finished": _ts(timedelta(days=7))}, + }, + "created": _ts(timedelta(days=200)), + "modified": _ts(timedelta(days=200)), + }, + { + "id": 30, + "type": "job_template", + "name": "Fleet Health Report", + "description": "Generate fleet health and inventory report", + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": True, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 1800, + "forks": 30, + "status": "successful", + "last_job_run": _ts(timedelta(hours=6)), + "summary_fields": { + "project": {"id": 8, "name": "Fleet Reporting", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1025, "status": "successful", "finished": _ts(timedelta(hours=6))}, + }, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=14)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Jobs (recent runs) +# --------------------------------------------------------------------------- + +PROD_HOSTS = [ + "web-01.prod.example.com", + "web-02.prod.example.com", + "db-01.prod.example.com", + "db-02.prod.example.com", + "app-01.prod.example.com", + "app-02.prod.example.com", +] + +MOCK_JOBS = [ + { + "id": 1001, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "check", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=4, minutes=30)), + "finished": _ts(timedelta(hours=4)), + "elapsed": 1800.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1002, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=3, minutes=45)), + "finished": _ts(timedelta(hours=3)), + "elapsed": 2700.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1005, + "type": "job", + "name": "CVE Remediation - Package Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=12, minutes=20)), + "finished": _ts(timedelta(hours=12)), + "elapsed": 1200.0, + "job_template": 11, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "limit": "", + "extra_vars": '{"target_cve": "CVE-2024-54321"}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 11, "name": "CVE Remediation - Package Update"}, + }, + }, + { + "id": 1010, + "type": "job", + "name": "Compliance Check - STIG", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(days=1, hours=2)), + "finished": _ts(timedelta(days=1)), + "elapsed": 7200.0, + "job_template": 20, + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 20, "name": "Compliance Check - STIG"}, + }, + }, + { + "id": 1020, + "type": "job", + "name": "Emergency Patching", + "job_type": "run", + "status": "failed", + "failed": True, + "started": _ts(timedelta(days=7, hours=1)), + "finished": _ts(timedelta(days=7)), + "elapsed": 3600.0, + "job_template": 25, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 25, "name": "Emergency Patching"}, + }, + }, + { + "id": 1025, + "type": "job", + "name": "Fleet Health Report", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=6, minutes=30)), + "finished": _ts(timedelta(hours=6)), + "elapsed": 1800.0, + "job_template": 30, + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 30, "name": "Fleet Health Report"}, + }, + }, +] + +_next_job_id = 2000 + + +# --------------------------------------------------------------------------- +# Mock stdout generators +# --------------------------------------------------------------------------- + +def _generate_stdout(job: dict) -> str: + """Generate realistic Ansible playbook stdout for a job.""" + playbook_name = job.get("name", "Unknown") + job_type = job.get("job_type", "run") + status = job.get("status", "successful") + limit = job.get("limit", "") + hosts = limit.split(",") if limit else PROD_HOSTS[:3] + hosts = [h.strip() for h in hosts if h.strip()] + extra_vars = job.get("extra_vars", "{}") + mode = " (CHECK MODE)" if job_type == "check" else "" + + lines = [] + lines.append(f"PLAY [{playbook_name}] *****") + lines.append("") + + lines.append(f"TASK [Gathering Facts{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}]") + lines.append("") + + if "kernel" in playbook_name.lower(): + lines.append(f"TASK [Create boom snapshot for rollback{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}] => {{\"msg\": \"boom create --title pre-remediation-CVE-2024-12345\"}}") + lines.append("") + + lines.append(f"TASK [Check disk space for kernel update{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}] => {{\"msg\": \"Disk space OK: 45% used\"}}") + lines.append("") + + lines.append(f"TASK [Update kernel package{mode}] *****") + for h in hosts: + result = "changed" if status == "successful" else "fatal" + if result == "changed": + lines.append(f'changed: [{h}] => {{"msg": "kernel-5.14.0-362.24.1.el9_3 -> kernel-5.14.0-362.24.2.el9_3"}}') + else: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Permission denied", "rc": 1}}') + lines.append("") + + lines.append(f"TASK [Check if reboot is needed (needs-restarting -r){mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"rc": 1, "msg": "Reboot is required to fully utilize updates."}}') + lines.append("") + + elif "package" in playbook_name.lower(): + lines.append(f"TASK [Update target packages for CVE remediation{mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"msg": "httpd-2.4.53-7.el9 -> httpd-2.4.57-8.el9"}}') + lines.append("") + + lines.append(f"TASK [Restart affected services{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append(f"TASK [Verify service health{mode}] *****") + for h in hosts: + lines.append(f'ok: [{h}] => {{"msg": "Service httpd is running"}}') + lines.append("") + + elif "emergency" in playbook_name.lower() and status == "failed": + lines.append(f"TASK [Apply emergency patch{mode}] *****") + for h in hosts: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Missing sudo password (become_enabled not set)", "rc": 1}}') + lines.append("") + lines.append("NO MORE HOSTS LEFT *****") + lines.append("") + + else: + lines.append(f"TASK [Execute playbook tasks{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append("PLAY RECAP *****") + for h in hosts: + if status == "successful": + ok_count = random.randint(3, 6) + changed_count = random.randint(1, 3) + lines.append(f"{h:<45} : ok={ok_count} changed={changed_count} unreachable=0 failed=0 skipped=0 rescued=0 ignored=0") + elif status == "failed": + lines.append(f"{h:<45} : ok=1 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0") + lines.append("") + + return "\n".join(lines) + + +def _generate_events(job: dict) -> list[dict]: + """Generate realistic Ansible task events for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + events: list[dict] = [] + eid = 1 + + task_names = ["Gathering Facts"] + if "kernel" in job.get("name", "").lower(): + task_names += [ + "Create boom snapshot for rollback", + "Check disk space for kernel update", + "Update kernel package", + "Check if reboot is needed (needs-restarting -r)", + ] + elif "package" in job.get("name", "").lower(): + task_names += [ + "Update target packages for CVE remediation", + "Restart affected services", + "Verify service health", + ] + else: + task_names += ["Execute playbook tasks"] + + for task_name in task_names: + for host in hosts: + is_failed = job.get("status") == "failed" and task_name != "Gathering Facts" + events.append({ + "id": eid, + "type": "job_event", + "event": "runner_on_ok" if not is_failed else "runner_on_failed", + "task": task_name, + "host": host, + "host_name": host, + "play": job.get("name", ""), + "changed": task_name != "Gathering Facts" and not is_failed, + "failed": is_failed, + "event_data": { + "task": task_name, + "host": host, + "res": { + "changed": task_name != "Gathering Facts" and not is_failed, + "msg": "Task completed" if not is_failed else "Permission denied", + }, + }, + "created": _ts(timedelta(hours=4, minutes=30 - eid)), + }) + eid += 1 + + return events + + +def _generate_host_summaries(job: dict) -> list[dict]: + """Generate per-host summaries for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + summaries: list[dict] = [] + + for i, host in enumerate(hosts): + is_failed = job.get("status") == "failed" + summaries.append({ + "id": i + 1, + "type": "job_host_summary", + "host": i + 1, + "host_name": host, + "ok": 1 if is_failed else random.randint(3, 6), + "changed": 0 if is_failed else random.randint(1, 3), + "dark": 0, + "failures": 1 if is_failed else 0, + "skipped": 0, + "processed": 1, + "failed": is_failed, + }) + + return summaries + + +# --------------------------------------------------------------------------- +# MCP Tools: Job Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def job_templates_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available job templates in AAP. + + Args: + page_size: Number of results per page (default 10, max 200). + search: Optional search string to filter templates by name. + """ + results = MOCK_JOB_TEMPLATES + if search: + s = search.lower() + results = [t for t in results if s in t["name"].lower() or s in t.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_retrieve(id: str) -> dict: + """Retrieve detailed information about a specific job template. + + Args: + id: Job template ID (as string). + """ + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + return template + + +@mcp.tool() +def projects_list( + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List available projects in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter projects by name. + """ + results = MOCK_PROJECTS + if search: + s = search.lower() + results = [p for p in results if s in p["name"].lower() or s in p.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_launch_retrieve( + id: str, + requestBody: Optional[dict] = None, +) -> dict: + """Launch a job from a job template. + + Args: + id: Job template ID to launch. + requestBody: Optional launch parameters including job_type ('run' or 'check'), + extra_vars (dict), and limit (comma-separated host list). + """ + global _next_job_id + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + + body = requestBody or {} + job_type = body.get("job_type", template.get("job_type", "run")) + + if not template.get("ask_job_type_on_launch") and job_type != template.get("job_type"): + return { + "error": f"Cannot override job_type: ask_job_type_on_launch is disabled on template {id}", + } + + job_id = _next_job_id + _next_job_id += 1 + + new_job = { + "id": job_id, + "type": "job", + "name": template["name"], + "job_type": job_type, + "status": "pending", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": None, + "elapsed": 0.0, + "job_template": tid, + "inventory": template["inventory"], + "project": template["project"], + "playbook": template["playbook"], + "limit": body.get("limit", ""), + "extra_vars": str(body.get("extra_vars", {})), + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": tid, "name": template["name"]}, + }, + } + MOCK_JOBS.append(new_job) + + # Simulate job completion after launch + new_job["status"] = "successful" + new_job["finished"] = _ts(timedelta(seconds=-300)) + new_job["elapsed"] = 300.0 + + return { + "job": job_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{job_id}/", + "related": { + "stdout": f"/api/controller/v2/jobs/{job_id}/stdout/", + "job_events": f"/api/controller/v2/jobs/{job_id}/job_events/", + "job_host_summaries": f"/api/controller/v2/jobs/{job_id}/job_host_summaries/", + }, + } + + +@mcp.tool() +def jobs_retrieve(id: int) -> dict: + """Get the status and details of a job run. + + Args: + id: Job ID to retrieve. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return job + + +@mcp.tool() +def jobs_list(page_size: int = 10) -> dict: + """List recent job runs. + + Args: + page_size: Number of results to return. + """ + results = sorted(MOCK_JOBS, key=lambda j: j.get("started", ""), reverse=True) + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def jobs_stdout_retrieve(id: int, format: str = "txt") -> dict: + """Get the stdout (console output) from a job run. + + Args: + id: Job ID. + format: Output format ('txt' or 'json'). Default 'txt'. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return { + "content": _generate_stdout(job), + "range": {"start": 0, "end": 1}, + } + + +@mcp.tool() +def jobs_job_events_list(id: int, page_size: int = 50) -> dict: + """Get task-level events for a job run. + + Args: + id: Job ID. + page_size: Number of events to return. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + events = _generate_events(job) + return { + "count": len(events), + "next": None, + "previous": None, + "results": events[:page_size], + } + + +@mcp.tool() +def jobs_job_host_summaries_list(id: int) -> dict: + """Get per-host execution summaries for a job run. + + Args: + id: Job ID. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + summaries = _generate_host_summaries(job) + return { + "count": len(summaries), + "next": None, + "previous": None, + "results": summaries, + } + + +@mcp.tool() +def jobs_relaunch_retrieve( + id: int, + hosts: str = "all", + job_type: str = "run", +) -> dict: + """Relaunch a previously completed or failed job. + + Args: + id: Original job ID to relaunch. + hosts: Which hosts to target ('all' or 'failed'). + job_type: Job type for relaunch ('run' or 'check'). + """ + global _next_job_id + original = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not original: + return {"detail": f"Not found. Job {id} does not exist."} + + new_id = _next_job_id + _next_job_id += 1 + + new_job = { + **original, + "id": new_id, + "job_type": job_type, + "status": "successful", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": _ts(timedelta(seconds=-300)), + "elapsed": 300.0, + "launch_type": "relaunch", + } + MOCK_JOBS.append(new_job) + + return { + "job": new_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{new_id}/", + } + + +# --------------------------------------------------------------------------- +# MCP Tools: Inventory Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def inventories_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available inventories in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter inventories. + """ + results = MOCK_INVENTORIES + if search: + s = search.lower() + results = [inv for inv in results if s in inv["name"].lower() or s in inv.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def hosts_list( + inventory_id: Optional[int] = None, + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List hosts in an inventory. + + Args: + inventory_id: Filter by inventory ID. If not provided, lists hosts from all inventories. + page_size: Number of results per page. + search: Optional search string to filter hosts by name. + """ + inv_id = inventory_id or 1 + hosts = _generate_hosts(inv_id) + if search: + s = search.lower() + hosts = [h for h in hosts if s in h["name"].lower()] + return { + "count": len(hosts), + "next": None if len(hosts) <= page_size else f"/api/controller/v2/hosts/?page=2", + "previous": None, + "results": hosts[:page_size], + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-aap-validator/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__mcp-aap-validator/instruction.md b/evaluation/without_skills/rh-sre__mcp-aap-validator/instruction.md new file mode 100644 index 00000000..54d1a0e6 --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-aap-validator/instruction.md @@ -0,0 +1,16 @@ +# AAP Connectivity Check Task + +You are a Red Hat SRE. Before starting a remediation workflow that depends on Ansible Automation Platform, you need to verify that the AAP integration is working correctly. + +## Scenario +You are about to run a remediation workflow that uses AAP to execute playbooks. First, you need to confirm that the AAP connection is healthy and that you can access the necessary resources. + +## Requirements +- Test connectivity to the AAP server by querying job templates and inventories +- Verify that the response is valid and contains expected data +- If any connection fails, document the error and provide troubleshooting guidance (credentials, network, SSL, permissions) +- Report the overall AAP readiness status: which capabilities are available and which are not + +Document your methodology, connectivity check results, and troubleshooting findings in `/root/report.md`. + +Use MCP tools to interact with AAP. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-sre__mcp-aap-validator/solution/solve.sh b/evaluation/without_skills/rh-sre__mcp-aap-validator/solution/solve.sh new file mode 100644 index 00000000..88542def --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-aap-validator/solution/solve.sh @@ -0,0 +1,25 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# AAP MCP Validation + +## Test Calls +- `job_templates_list(page_size: 10)` from aap-mcp-job-management ✅ +- `inventories_list(page_size: 10)` from aap-mcp-inventory-management ✅ + +## Result +| Server | Outcome | +|--------|---------| +| aap-mcp-job-management | ✅ PASSED | +| aap-mcp-inventory-management | ✅ PASSED | + +## Diagnostics +| Code | Meaning | +|------|---------| +| 401 | Token expired or invalid → regenerate in AAP Web UI → Users → Tokens | +| 403 | Insufficient RBAC (need Job Templates, Inventories) | +| 404 | Wrong URL — AAP_MCP_SERVER must point to MCP gateway, not main AAP UI | + +## Environment +- AAP_MCP_SERVER: Set (must point to MCP gateway) +- AAP_API_TOKEN: Set +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__mcp-aap-validator/task.toml b/evaluation/without_skills/rh-sre__mcp-aap-validator/task.toml new file mode 100644 index 00000000..aad389ea --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-aap-validator/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__mcp-aap-validator" +name = "rh-sre AAP MCP Validation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "mcp-aap-validator", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__mcp-aap-validator/tests/llm_judge.py b/evaluation/without_skills/rh-sre__mcp-aap-validator/tests/llm_judge.py new file mode 100644 index 00000000..474598a6 --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-aap-validator/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "gateway_vs_ui_url", "file": "/root/report.md", "question": "Does the report note that AAP_MCP_SERVER must point to the MCP gateway endpoint, not the main AAP UI URL, and that 404 indicates wrong URL?", "reference": "A skilled report explains the gateway/UI URL distinction and maps 404 to wrong URL. An unskilled report doesn't distinguish these endpoints."}, + {"id": "both_servers_tested", "file": "/root/report.md", "question": "Does the report test both job_templates_list and inventories_list for AAP MCP validation?", "reference": "A skilled report validates both MCP servers. An unskilled report tests only one."}, + {"id": "structured_outcome", "file": "/root/report.md", "question": "Does the report present per-server validation outcomes (PASSED/FAILED/PARTIAL) in table format?", "reference": "A skilled report uses structured table with per-server status. An unskilled report uses unstructured text."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__mcp-aap-validator/tests/test.sh b/evaluation/without_skills/rh-sre__mcp-aap-validator/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-aap-validator/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__mcp-aap-validator/tests/test_outputs.py b/evaluation/without_skills/rh-sre__mcp-aap-validator/tests/test_outputs.py new file mode 100644 index 00000000..615713b5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-aap-validator/tests/test_outputs.py @@ -0,0 +1,66 @@ +""" +Tests for rh-sre__mcp-aap-validator per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['aap', 'mcp', 'valid', 'connect']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_both_servers_tested(self): + """Skill: Test BOTH job_templates_list (job-management) AND inventories_list (inventory-management).""" + c = read_report().lower() + has_job = any(t in c for t in ["job_template", "job template", "job-management"]) + has_inv = any(t in c for t in ["inventor", "inventory-management"]) + assert has_job or has_inv, ( + "should test both AAP MCP servers (skill: job-management + inventory-management)" + ) + + def test_mcp_gateway_not_ui(self): + """Skill teaches AAP_MCP_SERVER must point to MCP gateway endpoint, not main AAP UI URL.""" + c = read_report().lower() + assert ("gateway" in c and "mcp" in c) or "aap_mcp_server" in c, ( + "should note AAP_MCP_SERVER must point to MCP gateway, not UI (skill: wrong URL = 404)" + ) + + def test_404_wrong_url(self): + """Skill teaches HTTP 404 = wrong AAP_MCP_SERVER URL.""" + c = read_report().lower() + assert "404" in c and any(t in c for t in ["url", "wrong"]), ( + "should explain 404 indicates wrong URL (skill: troubleshooting)" + ) + + def test_table_format(self): + """Skill: Output table with Server | Outcome (PASSED/FAILED/PARTIAL).""" + content = read_report() + c = content.lower() + has_table = "|" in content + has_outcome = any(t in c for t in ["passed", "failed", "partial", "job_templates_list", "inventories_list"]) + assert has_table or has_outcome, ( + "should use table format with outcome (skill: Report Format)" + ) diff --git a/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/environment/Dockerfile b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/environment/Dockerfile new file mode 100644 index 00000000..5a2cfdee --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/environment/Dockerfile @@ -0,0 +1,43 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/instruction.md b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/instruction.md new file mode 100644 index 00000000..37d450b8 --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/instruction.md @@ -0,0 +1,16 @@ +# Lightspeed Connectivity Check Task + +You are a Red Hat SRE. Before querying CVE data or generating remediation playbooks, you need to verify that the Red Hat Insights/Lightspeed integration is working correctly. + +## Scenario +You are about to start a CVE investigation that depends on querying vulnerability data from Red Hat Insights. First, you need to confirm the Lightspeed connection is healthy and returning valid data. + +## Requirements +- Test connectivity to the Lightspeed service by querying CVE data +- Verify the response is valid and contains expected vulnerability information +- If the connection fails, document the error and provide troubleshooting guidance (expired tokens, credentials, network issues, server availability) +- Report the overall Lightspeed readiness status + +Document your methodology, connectivity check results, and troubleshooting findings in `/root/report.md`. + +Use MCP tools to interact with the Lightspeed service. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/solution/solve.sh b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/solution/solve.sh new file mode 100644 index 00000000..8336f1ee --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/solution/solve.sh @@ -0,0 +1,29 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Lightspeed MCP Validation + +## Test: Call vulnerability__get_cves with no parameters +- Do NOT pass `limit` parameter (serialization issue: `limit` → `limit_`) +- Default limit=10 is applied automatically + +## Result +| Server | Outcome | +|--------|---------| +| lightspeed-mcp | ✅ PASSED | + +## Failure Root Causes (when connection fails) +- **Credentials**: LIGHTSPEED_CLIENT_ID or LIGHTSPEED_CLIENT_SECRET not set or invalid +- **Expired credentials**: Red Hat Console tokens may have expired +- **Server not running**: MCP server/container may be stopped +- **Network**: Firewall or proxy blocking console.redhat.com +- **Configuration**: .mcp.json misconfigured or server not registered + +## Troubleshooting +1. Verify env vars: LIGHTSPEED_CLIENT_ID, LIGHTSPEED_CLIENT_SECRET (never echo values) +2. Check credentials at: https://console.redhat.com/settings/integrations +3. Restart MCP server or host after config changes + +## Environment +- LIGHTSPEED_CLIENT_ID: Set +- LIGHTSPEED_CLIENT_SECRET: Set +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/task.toml b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/task.toml new file mode 100644 index 00000000..1e356701 --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__mcp-lightspeed-validator" +name = "rh-sre Lightspeed MCP Validation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "mcp-lightspeed-validator", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/tests/llm_judge.py b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/tests/llm_judge.py new file mode 100644 index 00000000..905e9250 --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "no_params_get_cves", "file": "/root/report.md", "question": "Does the report call get_cves with no parameters (due to limit_ serialization bug)?", "reference": "A skilled report avoids passing limit parameter. An unskilled report passes limit which may break the call."}, + {"id": "credential_handling", "file": "/root/report.md", "question": "Does the report reference LIGHTSPEED_CLIENT_ID/CLIENT_SECRET env vars and warn against echoing credentials?", "reference": "A skilled report identifies the correct env vars and warns about credential exposure. An unskilled report doesn't know the specific variable names."}, + {"id": "validation_structure", "file": "/root/report.md", "question": "Does the report present Lightspeed MCP validation in structured table format?", "reference": "A skilled report uses table with PASSED/FAILED outcome. An unskilled report uses unstructured text."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/tests/test.sh b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/tests/test_outputs.py b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/tests/test_outputs.py new file mode 100644 index 00000000..05e6bf9b --- /dev/null +++ b/evaluation/without_skills/rh-sre__mcp-lightspeed-validator/tests/test_outputs.py @@ -0,0 +1,64 @@ +""" +Tests for rh-sre__mcp-lightspeed-validator per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['lightspeed', 'mcp', 'valid']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_get_cves_no_params(self): + """Skill: Call vulnerability__get_cves with NO parameters (limit causes limit_ serialization bug).""" + c = read_report().lower() + assert any(t in c for t in ["no param", "without param", "limit_"]), ( + "should call get_cves without parameters (skill: passing limit breaks some clients)" + ) + + def test_lightspeed_credentials(self): + """Skill: LIGHTSPEED_CLIENT_ID + LIGHTSPEED_CLIENT_SECRET are the env vars.""" + c = read_report().lower() + assert any(t in c for t in ["lightspeed_client_id", "client_id", "client_secret"]), ( + "should reference Lightspeed credential env vars (skill: LIGHTSPEED_CLIENT_ID/SECRET)" + ) + + def test_never_echo_credentials(self): + """Skill: Never echo or log credential values.""" + c = read_report().lower() + has_security = any(t in c for t in ["never echo", "do not echo", "redact", "sensitive", "protect"]) + assert has_security or "credential" in c, ( + "should address credential handling (skill: never echo values)" + ) + + def test_table_format(self): + """Skill: Output table with Server | Outcome.""" + c = read_report().lower() + has_table = "|" in read_report() + has_outcome = any(t in c for t in ["passed", "failed", "get_cves", "lightspeed"]) + assert has_table or has_outcome, ( + "should use table format (skill: Report Format)" + ) diff --git a/evaluation/without_skills/rh-sre__playbook-executor/environment/Dockerfile b/evaluation/without_skills/rh-sre__playbook-executor/environment/Dockerfile new file mode 100644 index 00000000..51ce02e5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-executor/environment/Dockerfile @@ -0,0 +1,47 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + }, \ + "aap-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-aap-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-aap-mcp.py b/evaluation/without_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-aap-mcp.py new file mode 100644 index 00000000..d8ae4fd5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-aap-mcp.py @@ -0,0 +1,1048 @@ +#!/usr/bin/env python3 +""" +Mock AAP (Ansible Automation Platform) MCP Server + +Simulates the AAP MCP gateway for per-skill evaluation tasks. Implements +the full set of tools used by rh-sre skills: + - job_templates_list / job_templates_retrieve + - projects_list + - job_templates_launch_retrieve + - jobs_retrieve / jobs_stdout_retrieve + - jobs_job_events_list / jobs_job_host_summaries_list + - jobs_relaunch_retrieve + - inventories_list / hosts_list + +Data mirrors a realistic AAP deployment: + - 6 job templates (3 remediation, 1 compliance, 1 patching, 1 reporting) + - 3 projects (remediation, compliance, reporting) + - 3 inventories (production 30 hosts, staging 15 hosts, all-managed 63 hosts) + - 12 recent jobs with varied statuses + +Follows the same mock-server pattern as mock-lightspeed-mcp.py. +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +mcp = FastMCP("aap-mcp") + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +def _ts(delta: timedelta) -> str: + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +# --------------------------------------------------------------------------- +# Mock data: Projects +# --------------------------------------------------------------------------- + +MOCK_PROJECTS = [ + { + "id": 6, + "type": "project", + "name": "Remediation Playbooks", + "description": "CVE and security remediation playbooks managed via Git", + "scm_type": "git", + "scm_url": "https://github.com/org/remediation-playbooks.git", + "scm_branch": "main", + "scm_revision": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2", + "status": "successful", + "last_job_run": _ts(timedelta(hours=2)), + "last_update_failed": False, + "created": _ts(timedelta(days=90)), + "modified": _ts(timedelta(hours=2)), + }, + { + "id": 7, + "type": "project", + "name": "Compliance Checks", + "description": "STIG and CIS compliance scanning playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/compliance-playbooks.git", + "scm_branch": "main", + "scm_revision": "b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3", + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "last_update_failed": False, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 8, + "type": "project", + "name": "Fleet Reporting", + "description": "System inventory and health reporting playbooks", + "scm_type": "git", + "scm_url": "https://github.com/org/fleet-reports.git", + "scm_branch": "main", + "scm_revision": "c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4", + "status": "successful", + "last_job_run": _ts(timedelta(days=3)), + "last_update_failed": False, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=3)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Inventories & Hosts +# --------------------------------------------------------------------------- + +MOCK_INVENTORIES = [ + { + "id": 1, + "type": "inventory", + "name": "Production Systems", + "description": "All production RHEL systems across data centers", + "total_hosts": 30, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 5, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(days=1)), + }, + { + "id": 2, + "type": "inventory", + "name": "Staging Systems", + "description": "Pre-production staging environment", + "total_hosts": 15, + "has_active_failures": False, + "hosts_with_active_failures": 0, + "total_groups": 3, + "groups_with_active_failures": 0, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=300)), + "modified": _ts(timedelta(days=7)), + }, + { + "id": 3, + "type": "inventory", + "name": "All Managed Systems", + "description": "Complete fleet: production, staging, development, QA, legacy", + "total_hosts": 63, + "has_active_failures": True, + "hosts_with_active_failures": 2, + "total_groups": 8, + "groups_with_active_failures": 1, + "has_inventory_sources": True, + "organization": 1, + "created": _ts(timedelta(days=365)), + "modified": _ts(timedelta(hours=6)), + }, +] + + +def _generate_hosts(inventory_id: int) -> list[dict]: + """Generate realistic hosts for an inventory.""" + hosts: list[dict] = [] + if inventory_id == 1: + roles = ["web", "db", "app", "lb", "monitoring", "cache"] + for i, role in enumerate(roles): + for j in range(5 if role in ("web", "app") else 4 if role == "db" else 3 if role == "monitoring" else 2): + hosts.append({ + "id": len(hosts) + 1, + "type": "host", + "name": f"{role}-{j+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "production", "role": "{role}"}}', + }) + if len(hosts) >= 30: + break + if len(hosts) >= 30: + break + elif inventory_id == 2: + for i in range(15): + role = ["web", "db", "app"][i % 3] + hosts.append({ + "id": 100 + i, + "type": "host", + "name": f"{role}-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging", "role": "{role}"}}', + }) + elif inventory_id == 3: + for i in range(30): + hosts.append({ + "id": 200 + i, + "type": "host", + "name": f"host-{i+1:02d}.prod.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": i in (45, 58), + "variables": f'{{"rhel_version": "9.3", "environment": "production"}}', + }) + for i in range(15): + hosts.append({ + "id": 230 + i, + "type": "host", + "name": f"host-{i+1:02d}.staging.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.3", "environment": "staging"}}', + }) + for i in range(10): + hosts.append({ + "id": 245 + i, + "type": "host", + "name": f"dev-{i+1:02d}.dev.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "8.9", "environment": "development"}}', + }) + for i in range(5): + hosts.append({ + "id": 255 + i, + "type": "host", + "name": f"qa-{i+1:02d}.qa.example.com", + "inventory": inventory_id, + "enabled": True, + "has_active_failures": False, + "variables": f'{{"rhel_version": "9.2", "environment": "qa"}}', + }) + for i in range(3): + hosts.append({ + "id": 260 + i, + "type": "host", + "name": f"legacy-{i+1:02d}.corp.example.com", + "inventory": inventory_id, + "enabled": i < 2, + "has_active_failures": i == 2, + "variables": f'{{"rhel_version": "7.9", "environment": "legacy"}}', + }) + return hosts + + +# --------------------------------------------------------------------------- +# Mock data: Job Templates +# --------------------------------------------------------------------------- + +MOCK_JOB_TEMPLATES = [ + { + "id": 10, + "type": "job_template", + "name": "CVE Remediation - Kernel Update", + "description": "Kernel update with boom snapshot for rollback safety", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "successful", + "last_job_run": _ts(timedelta(hours=4)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1001, "status": "successful", "finished": _ts(timedelta(hours=4))}, + }, + "created": _ts(timedelta(days=60)), + "modified": _ts(timedelta(days=2)), + }, + { + "id": 11, + "type": "job_template", + "name": "CVE Remediation - Package Update", + "description": "General package update for CVE remediation with needs-restarting check", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "check", + "verbosity": 1, + "timeout": 1800, + "forks": 10, + "status": "successful", + "last_job_run": _ts(timedelta(hours=12)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1005, "status": "successful", "finished": _ts(timedelta(hours=12))}, + }, + "created": _ts(timedelta(days=45)), + "modified": _ts(timedelta(days=5)), + }, + { + "id": 12, + "type": "job_template", + "name": "CVE Remediation - Generic", + "description": "Generic CVE remediation template for ad-hoc patches", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-remediation.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": True, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": True, + "job_type": "check", + "verbosity": 1, + "timeout": 3600, + "forks": 5, + "status": "never updated", + "last_job_run": None, + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + }, + "created": _ts(timedelta(days=30)), + "modified": _ts(timedelta(days=30)), + }, + { + "id": 20, + "type": "job_template", + "name": "Compliance Check - STIG", + "description": "Run STIG compliance scan across fleet", + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "become_enabled": True, + "ask_job_type_on_launch": True, + "ask_variables_on_launch": False, + "ask_limit_on_launch": True, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 7200, + "forks": 20, + "status": "successful", + "last_job_run": _ts(timedelta(days=1)), + "summary_fields": { + "project": {"id": 7, "name": "Compliance Checks", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 2, "name": "compliance-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1010, "status": "successful", "finished": _ts(timedelta(days=1))}, + }, + "created": _ts(timedelta(days=180)), + "modified": _ts(timedelta(days=14)), + }, + { + "id": 25, + "type": "job_template", + "name": "Emergency Patching", + "description": "Emergency patch application — NO become enabled (misconfigured)", + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": False, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 600, + "forks": 25, + "status": "failed", + "last_job_run": _ts(timedelta(days=7)), + "summary_fields": { + "project": {"id": 6, "name": "Remediation Playbooks", "status": "successful"}, + "inventory": {"id": 1, "name": "Production Systems", "total_hosts": 30}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1020, "status": "failed", "finished": _ts(timedelta(days=7))}, + }, + "created": _ts(timedelta(days=200)), + "modified": _ts(timedelta(days=200)), + }, + { + "id": 30, + "type": "job_template", + "name": "Fleet Health Report", + "description": "Generate fleet health and inventory report", + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "become_enabled": False, + "ask_job_type_on_launch": False, + "ask_variables_on_launch": True, + "ask_limit_on_launch": False, + "ask_inventory_on_launch": False, + "job_type": "run", + "verbosity": 0, + "timeout": 1800, + "forks": 30, + "status": "successful", + "last_job_run": _ts(timedelta(hours=6)), + "summary_fields": { + "project": {"id": 8, "name": "Fleet Reporting", "status": "successful"}, + "inventory": {"id": 3, "name": "All Managed Systems", "total_hosts": 63}, + "credentials": [ + {"id": 1, "name": "machine-credential", "kind": "ssh"}, + ], + "last_job": {"id": 1025, "status": "successful", "finished": _ts(timedelta(hours=6))}, + }, + "created": _ts(timedelta(days=120)), + "modified": _ts(timedelta(days=14)), + }, +] + +# --------------------------------------------------------------------------- +# Mock data: Jobs (recent runs) +# --------------------------------------------------------------------------- + +PROD_HOSTS = [ + "web-01.prod.example.com", + "web-02.prod.example.com", + "db-01.prod.example.com", + "db-02.prod.example.com", + "app-01.prod.example.com", + "app-02.prod.example.com", +] + +MOCK_JOBS = [ + { + "id": 1001, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "check", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=4, minutes=30)), + "finished": _ts(timedelta(hours=4)), + "elapsed": 1800.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1002, + "type": "job", + "name": "CVE Remediation - Kernel Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=3, minutes=45)), + "finished": _ts(timedelta(hours=3)), + "elapsed": 2700.0, + "job_template": 10, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-kernel-update.yml", + "limit": "web-01.prod.example.com,web-02.prod.example.com,db-01.prod.example.com", + "extra_vars": '{"target_cve": "CVE-2024-12345", "remediation_mode": "automated", "verify_after": true}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 10, "name": "CVE Remediation - Kernel Update"}, + }, + }, + { + "id": 1005, + "type": "job", + "name": "CVE Remediation - Package Update", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=12, minutes=20)), + "finished": _ts(timedelta(hours=12)), + "elapsed": 1200.0, + "job_template": 11, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/cve-package-update.yml", + "limit": "", + "extra_vars": '{"target_cve": "CVE-2024-54321"}', + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 11, "name": "CVE Remediation - Package Update"}, + }, + }, + { + "id": 1010, + "type": "job", + "name": "Compliance Check - STIG", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(days=1, hours=2)), + "finished": _ts(timedelta(days=1)), + "elapsed": 7200.0, + "job_template": 20, + "inventory": 3, + "project": 7, + "playbook": "playbooks/compliance/check-all.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 20, "name": "Compliance Check - STIG"}, + }, + }, + { + "id": 1020, + "type": "job", + "name": "Emergency Patching", + "job_type": "run", + "status": "failed", + "failed": True, + "started": _ts(timedelta(days=7, hours=1)), + "finished": _ts(timedelta(days=7)), + "elapsed": 3600.0, + "job_template": 25, + "inventory": 1, + "project": 6, + "playbook": "playbooks/remediation/emergency-patch.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": 25, "name": "Emergency Patching"}, + }, + }, + { + "id": 1025, + "type": "job", + "name": "Fleet Health Report", + "job_type": "run", + "status": "successful", + "failed": False, + "started": _ts(timedelta(hours=6, minutes=30)), + "finished": _ts(timedelta(hours=6)), + "elapsed": 1800.0, + "job_template": 30, + "inventory": 3, + "project": 8, + "playbook": "playbooks/reporting/fleet-health.yml", + "limit": "", + "extra_vars": "{}", + "launch_type": "scheduled", + "summary_fields": { + "job_template": {"id": 30, "name": "Fleet Health Report"}, + }, + }, +] + +_next_job_id = 2000 + + +# --------------------------------------------------------------------------- +# Mock stdout generators +# --------------------------------------------------------------------------- + +def _generate_stdout(job: dict) -> str: + """Generate realistic Ansible playbook stdout for a job.""" + playbook_name = job.get("name", "Unknown") + job_type = job.get("job_type", "run") + status = job.get("status", "successful") + limit = job.get("limit", "") + hosts = limit.split(",") if limit else PROD_HOSTS[:3] + hosts = [h.strip() for h in hosts if h.strip()] + extra_vars = job.get("extra_vars", "{}") + mode = " (CHECK MODE)" if job_type == "check" else "" + + lines = [] + lines.append(f"PLAY [{playbook_name}] *****") + lines.append("") + + lines.append(f"TASK [Gathering Facts{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}]") + lines.append("") + + if "kernel" in playbook_name.lower(): + lines.append(f"TASK [Create boom snapshot for rollback{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}] => {{\"msg\": \"boom create --title pre-remediation-CVE-2024-12345\"}}") + lines.append("") + + lines.append(f"TASK [Check disk space for kernel update{mode}] *****") + for h in hosts: + lines.append(f"ok: [{h}] => {{\"msg\": \"Disk space OK: 45% used\"}}") + lines.append("") + + lines.append(f"TASK [Update kernel package{mode}] *****") + for h in hosts: + result = "changed" if status == "successful" else "fatal" + if result == "changed": + lines.append(f'changed: [{h}] => {{"msg": "kernel-5.14.0-362.24.1.el9_3 -> kernel-5.14.0-362.24.2.el9_3"}}') + else: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Permission denied", "rc": 1}}') + lines.append("") + + lines.append(f"TASK [Check if reboot is needed (needs-restarting -r){mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"rc": 1, "msg": "Reboot is required to fully utilize updates."}}') + lines.append("") + + elif "package" in playbook_name.lower(): + lines.append(f"TASK [Update target packages for CVE remediation{mode}] *****") + for h in hosts: + lines.append(f'changed: [{h}] => {{"msg": "httpd-2.4.53-7.el9 -> httpd-2.4.57-8.el9"}}') + lines.append("") + + lines.append(f"TASK [Restart affected services{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append(f"TASK [Verify service health{mode}] *****") + for h in hosts: + lines.append(f'ok: [{h}] => {{"msg": "Service httpd is running"}}') + lines.append("") + + elif "emergency" in playbook_name.lower() and status == "failed": + lines.append(f"TASK [Apply emergency patch{mode}] *****") + for h in hosts: + lines.append(f'fatal: [{h}]: FAILED! => {{"msg": "Missing sudo password (become_enabled not set)", "rc": 1}}') + lines.append("") + lines.append("NO MORE HOSTS LEFT *****") + lines.append("") + + else: + lines.append(f"TASK [Execute playbook tasks{mode}] *****") + for h in hosts: + lines.append(f"changed: [{h}]") + lines.append("") + + lines.append("PLAY RECAP *****") + for h in hosts: + if status == "successful": + ok_count = random.randint(3, 6) + changed_count = random.randint(1, 3) + lines.append(f"{h:<45} : ok={ok_count} changed={changed_count} unreachable=0 failed=0 skipped=0 rescued=0 ignored=0") + elif status == "failed": + lines.append(f"{h:<45} : ok=1 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0") + lines.append("") + + return "\n".join(lines) + + +def _generate_events(job: dict) -> list[dict]: + """Generate realistic Ansible task events for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + events: list[dict] = [] + eid = 1 + + task_names = ["Gathering Facts"] + if "kernel" in job.get("name", "").lower(): + task_names += [ + "Create boom snapshot for rollback", + "Check disk space for kernel update", + "Update kernel package", + "Check if reboot is needed (needs-restarting -r)", + ] + elif "package" in job.get("name", "").lower(): + task_names += [ + "Update target packages for CVE remediation", + "Restart affected services", + "Verify service health", + ] + else: + task_names += ["Execute playbook tasks"] + + for task_name in task_names: + for host in hosts: + is_failed = job.get("status") == "failed" and task_name != "Gathering Facts" + events.append({ + "id": eid, + "type": "job_event", + "event": "runner_on_ok" if not is_failed else "runner_on_failed", + "task": task_name, + "host": host, + "host_name": host, + "play": job.get("name", ""), + "changed": task_name != "Gathering Facts" and not is_failed, + "failed": is_failed, + "event_data": { + "task": task_name, + "host": host, + "res": { + "changed": task_name != "Gathering Facts" and not is_failed, + "msg": "Task completed" if not is_failed else "Permission denied", + }, + }, + "created": _ts(timedelta(hours=4, minutes=30 - eid)), + }) + eid += 1 + + return events + + +def _generate_host_summaries(job: dict) -> list[dict]: + """Generate per-host summaries for a job.""" + hosts = (job.get("limit", "").split(",") if job.get("limit") else PROD_HOSTS[:3]) + hosts = [h.strip() for h in hosts if h.strip()] + summaries: list[dict] = [] + + for i, host in enumerate(hosts): + is_failed = job.get("status") == "failed" + summaries.append({ + "id": i + 1, + "type": "job_host_summary", + "host": i + 1, + "host_name": host, + "ok": 1 if is_failed else random.randint(3, 6), + "changed": 0 if is_failed else random.randint(1, 3), + "dark": 0, + "failures": 1 if is_failed else 0, + "skipped": 0, + "processed": 1, + "failed": is_failed, + }) + + return summaries + + +# --------------------------------------------------------------------------- +# MCP Tools: Job Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def job_templates_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available job templates in AAP. + + Args: + page_size: Number of results per page (default 10, max 200). + search: Optional search string to filter templates by name. + """ + results = MOCK_JOB_TEMPLATES + if search: + s = search.lower() + results = [t for t in results if s in t["name"].lower() or s in t.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_retrieve(id: str) -> dict: + """Retrieve detailed information about a specific job template. + + Args: + id: Job template ID (as string). + """ + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + return template + + +@mcp.tool() +def projects_list( + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List available projects in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter projects by name. + """ + results = MOCK_PROJECTS + if search: + s = search.lower() + results = [p for p in results if s in p["name"].lower() or s in p.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def job_templates_launch_retrieve( + id: str, + requestBody: Optional[dict] = None, +) -> dict: + """Launch a job from a job template. + + Args: + id: Job template ID to launch. + requestBody: Optional launch parameters including job_type ('run' or 'check'), + extra_vars (dict), and limit (comma-separated host list). + """ + global _next_job_id + tid = int(id) + template = next((t for t in MOCK_JOB_TEMPLATES if t["id"] == tid), None) + if not template: + return {"detail": f"Not found. Job template {id} does not exist."} + + body = requestBody or {} + job_type = body.get("job_type", template.get("job_type", "run")) + + if not template.get("ask_job_type_on_launch") and job_type != template.get("job_type"): + return { + "error": f"Cannot override job_type: ask_job_type_on_launch is disabled on template {id}", + } + + job_id = _next_job_id + _next_job_id += 1 + + new_job = { + "id": job_id, + "type": "job", + "name": template["name"], + "job_type": job_type, + "status": "pending", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": None, + "elapsed": 0.0, + "job_template": tid, + "inventory": template["inventory"], + "project": template["project"], + "playbook": template["playbook"], + "limit": body.get("limit", ""), + "extra_vars": str(body.get("extra_vars", {})), + "launch_type": "manual", + "summary_fields": { + "job_template": {"id": tid, "name": template["name"]}, + }, + } + MOCK_JOBS.append(new_job) + + # Simulate job completion after launch + new_job["status"] = "successful" + new_job["finished"] = _ts(timedelta(seconds=-300)) + new_job["elapsed"] = 300.0 + + return { + "job": job_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{job_id}/", + "related": { + "stdout": f"/api/controller/v2/jobs/{job_id}/stdout/", + "job_events": f"/api/controller/v2/jobs/{job_id}/job_events/", + "job_host_summaries": f"/api/controller/v2/jobs/{job_id}/job_host_summaries/", + }, + } + + +@mcp.tool() +def jobs_retrieve(id: int) -> dict: + """Get the status and details of a job run. + + Args: + id: Job ID to retrieve. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return job + + +@mcp.tool() +def jobs_list(page_size: int = 10) -> dict: + """List recent job runs. + + Args: + page_size: Number of results to return. + """ + results = sorted(MOCK_JOBS, key=lambda j: j.get("started", ""), reverse=True) + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def jobs_stdout_retrieve(id: int, format: str = "txt") -> dict: + """Get the stdout (console output) from a job run. + + Args: + id: Job ID. + format: Output format ('txt' or 'json'). Default 'txt'. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + return { + "content": _generate_stdout(job), + "range": {"start": 0, "end": 1}, + } + + +@mcp.tool() +def jobs_job_events_list(id: int, page_size: int = 50) -> dict: + """Get task-level events for a job run. + + Args: + id: Job ID. + page_size: Number of events to return. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + events = _generate_events(job) + return { + "count": len(events), + "next": None, + "previous": None, + "results": events[:page_size], + } + + +@mcp.tool() +def jobs_job_host_summaries_list(id: int) -> dict: + """Get per-host execution summaries for a job run. + + Args: + id: Job ID. + """ + job = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not job: + return {"detail": f"Not found. Job {id} does not exist."} + summaries = _generate_host_summaries(job) + return { + "count": len(summaries), + "next": None, + "previous": None, + "results": summaries, + } + + +@mcp.tool() +def jobs_relaunch_retrieve( + id: int, + hosts: str = "all", + job_type: str = "run", +) -> dict: + """Relaunch a previously completed or failed job. + + Args: + id: Original job ID to relaunch. + hosts: Which hosts to target ('all' or 'failed'). + job_type: Job type for relaunch ('run' or 'check'). + """ + global _next_job_id + original = next((j for j in MOCK_JOBS if j["id"] == id), None) + if not original: + return {"detail": f"Not found. Job {id} does not exist."} + + new_id = _next_job_id + _next_job_id += 1 + + new_job = { + **original, + "id": new_id, + "job_type": job_type, + "status": "successful", + "failed": False, + "started": _ts(timedelta(seconds=0)), + "finished": _ts(timedelta(seconds=-300)), + "elapsed": 300.0, + "launch_type": "relaunch", + } + MOCK_JOBS.append(new_job) + + return { + "job": new_id, + "status": "pending", + "type": "job", + "url": f"/api/controller/v2/jobs/{new_id}/", + } + + +# --------------------------------------------------------------------------- +# MCP Tools: Inventory Management +# --------------------------------------------------------------------------- + +@mcp.tool() +def inventories_list( + page_size: int = 10, + search: Optional[str] = None, +) -> dict: + """List available inventories in AAP. + + Args: + page_size: Number of results per page. + search: Optional search string to filter inventories. + """ + results = MOCK_INVENTORIES + if search: + s = search.lower() + results = [inv for inv in results if s in inv["name"].lower() or s in inv.get("description", "").lower()] + return { + "count": len(results), + "next": None, + "previous": None, + "results": results[:page_size], + } + + +@mcp.tool() +def hosts_list( + inventory_id: Optional[int] = None, + page_size: int = 50, + search: Optional[str] = None, +) -> dict: + """List hosts in an inventory. + + Args: + inventory_id: Filter by inventory ID. If not provided, lists hosts from all inventories. + page_size: Number of results per page. + search: Optional search string to filter hosts by name. + """ + inv_id = inventory_id or 1 + hosts = _generate_hosts(inv_id) + if search: + s = search.lower() + hosts = [h for h in hosts if s in h["name"].lower()] + return { + "count": len(hosts), + "next": None if len(hosts) <= page_size else f"/api/controller/v2/hosts/?page=2", + "previous": None, + "results": hosts[:page_size], + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..fe5d549c --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-executor/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,695 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__playbook-executor/instruction.md b/evaluation/without_skills/rh-sre__playbook-executor/instruction.md new file mode 100644 index 00000000..5cced969 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-executor/instruction.md @@ -0,0 +1,18 @@ +# Playbook Execution Task + +You are a Red Hat SRE. A remediation playbook needs to be executed against production systems through Ansible Automation Platform. You are responsible for the safe execution and monitoring of this job. + +## Scenario +A CVE remediation playbook has been prepared and a job template exists in AAP. You need to execute it safely: validate the template first, consider running a dry-run, launch the production job, monitor its progress, and report the results. + +## Requirements +- Locate and validate the job template in AAP (check it has the right inventory, project, credentials, and privilege escalation) +- Document a pre-flight checklist: template readiness, target hosts, and any prerequisites +- Plan the execution: whether to run a dry-run (check mode) first, how to monitor job progress, and what to do if it fails +- Launch the job (or document the launch procedure) and monitor its status +- Report per-host results: which hosts succeeded, which failed, and any error details +- Include guidance for handling failures (retry, rollback, escalation) + +Document your methodology, execution plan, and results in `/root/report.md`. + +Use MCP tools to interact with AAP. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-sre__playbook-executor/solution/solve.sh b/evaluation/without_skills/rh-sre__playbook-executor/solution/solve.sh new file mode 100644 index 00000000..090c2294 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-executor/solution/solve.sh @@ -0,0 +1,21 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Playbook Execution Report + +## Execution Steps +1. Dry-run: job_type='check' (Ansible check mode) +2. Review results +3. Execute: job_type='run' + +## Git Flow +Playbook stored at playbooks/remediation/cve-2024-12345.yml. Commit, push, wait for sync complete before launch. No override at launch—AAP runs from synced project. + +## Job Template Validation +Invoke job-template-remediation-validator for each candidate template. + +## Execution Report +- Status: Success +- Systems patched: 4/4 +- Validate job log (jobs_stdout_retrieve) for CVE handling +- Suggest remediation-verifier after success +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__playbook-executor/task.toml b/evaluation/without_skills/rh-sre__playbook-executor/task.toml new file mode 100644 index 00000000..eaa9b790 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-executor/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__playbook-executor" +name = "rh-sre Playbook Execution Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "playbook-executor", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__playbook-executor/tests/llm_judge.py b/evaluation/without_skills/rh-sre__playbook-executor/tests/llm_judge.py new file mode 100644 index 00000000..15da24ed --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-executor/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "launch_config_and_git_flow", "file": "/root/report.md", "question": "Does the report configure launch-time prompts for flexibility (variables, host limits, job type) and require Git synchronization before execution?", "reference": "A skilled report configures launch-time prompts and requires Git sync. An unskilled report hardcodes execution settings and skips synchronization requirements."}, + {"id": "relaunch_failed_hosts", "file": "/root/report.md", "question": "Does the report mention relaunching with hosts: failed to retry only failed hosts?", "reference": "A skilled report uses jobs_relaunch_retrieve with hosts: failed. An unskilled report suggests full re-execution."}, + {"id": "dry_run_and_monitoring", "file": "/root/report.md", "question": "Does the report recommend dry-run first and include per-host execution monitoring?", "reference": "A skilled report follows check mode before run and monitors per-host. An unskilled report runs directly without dry-run."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__playbook-executor/tests/test.sh b/evaluation/without_skills/rh-sre__playbook-executor/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-executor/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__playbook-executor/tests/test_outputs.py b/evaluation/without_skills/rh-sre__playbook-executor/tests/test_outputs.py new file mode 100644 index 00000000..dab37078 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-executor/tests/test_outputs.py @@ -0,0 +1,89 @@ +""" +Tests for rh-sre__playbook-executor per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['playbook', 'execut', 'job']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_git_flow_mandatory(self): + """Skill: When template playbook path differs from generated playbook, Git Flow (commit, push, sync) is MANDATORY before launch.""" + c = read_report().lower() + has_git = any(t in c for t in ["git", "commit", "push", "sync"]) + has_block = any(t in c for t in ["before launch", "mandatory", "must", "block", "sync complete"]) + assert has_git or has_block, ( + "should require Git Flow when path differs (skill: no override at launch)" + ) + + def test_launch_configuration(self): + """Skill teaches configuring launch-time prompts for execution flexibility + (job type, variables, host limiting). Without skill, agents run playbooks + with hardcoded settings.""" + c = read_report().lower() + has_launch = any(t in c for t in ["launch", "prompt", "on launch"]) + has_config = any(t in c for t in [ + "variable", "limit", "job type", "configur", + ]) + assert has_launch and has_config, ( + "should configure launch-time prompts for execution flexibility" + ) + + def test_relaunch_failed_hosts(self): + """Skill: jobs_relaunch_retrieve with hosts: 'failed' to retry only failed hosts.""" + c = read_report().lower() + assert any(t in c for t in ["relaunch", "failed hosts", "retry failed"]), ( + "should mention relaunch for failed hosts (skill: jobs_relaunch_retrieve)" + ) + + def test_dry_run_first(self): + """Skill: Recommend dry-run (check mode) before production execution.""" + c = read_report().lower() + assert any(t in c for t in ["dry", "check mode", "check_mode", "preview", "before launch"]), ( + "should recommend dry-run first (skill: Phase 3)" + ) + + def test_per_host_results(self): + """Skill: Report per-host results (succeeded, failed, error details).""" + c = read_report().lower() + has_per_host = any(t in c for t in ["per host", "each host", "host result", "stdout", "host summary"]) + has_ansible_outcome = any(t in c for t in ["succeeded", "failed", "unreachable", "skipped", "changed"]) + assert has_per_host or has_ansible_outcome, ( + "should report per-host execution results (skill: host summaries)" + ) + + def test_error_taxonomy(self): + """Docs teach error taxonomy: connection/permissions/package/service/disk + failure categories with specific recovery paths. + Without docs, agents treat all errors generically.""" + c = read_report().lower() + categories = ["connection", "permission", "package", "service", "disk"] + mentioned = sum(1 for cat in categories if cat in c) + assert mentioned >= 2, ( + "should categorize errors by type (connection/permissions/package/service/disk)" + ) diff --git a/evaluation/without_skills/rh-sre__playbook-generator/environment/Dockerfile b/evaluation/without_skills/rh-sre__playbook-generator/environment/Dockerfile new file mode 100644 index 00000000..5a2cfdee --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-generator/environment/Dockerfile @@ -0,0 +1,43 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__playbook-generator/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__playbook-generator/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..2269a235 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-generator/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,722 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2026-1234": { + "cve_id": "CVE-2026-1234", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Critical kernel vulnerability: remote code execution in kernel network stack allows unauthenticated attackers to execute arbitrary code via crafted packets", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2026-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 8, + "total_remediated": 2, + "total_vulnerable": 6, + }, + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__playbook-generator/instruction.md b/evaluation/without_skills/rh-sre__playbook-generator/instruction.md new file mode 100644 index 00000000..585c6f73 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-generator/instruction.md @@ -0,0 +1,17 @@ +# Playbook Generation Task + +You are a Red Hat SRE. A critical CVE has been identified affecting systems in your fleet. You need to generate a remediation playbook that can be used to patch the affected hosts. + +## Scenario +CVE-2026-1234 has been confirmed as a critical kernel vulnerability affecting multiple production RHEL systems. You need to generate an Ansible playbook that remediates this CVE on the affected hosts. + +## Requirements +- Use available tools to generate a remediation playbook for the CVE +- Review the generated playbook content: what packages it updates, whether it requires a reboot, and any risk factors +- Document the playbook metadata: target CVE, affected systems, reboot requirements, and delegation safety notes +- If playbook generation fails, document the failure and describe alternative approaches +- The playbook should be ready to hand off for execution (do not execute it yourself) + +Document the generated playbook and your analysis in `/root/report.md`. You MUST write the report file — do not just display the content. + +Use MCP tools to interact with the environment. If reference documentation or skills are available in this environment, consult them before beginning work. Complete the entire analysis autonomously — do not stop after preliminary steps. Proceed through playbook generation, review, and report writing without waiting for user input. diff --git a/evaluation/without_skills/rh-sre__playbook-generator/solution/solve.sh b/evaluation/without_skills/rh-sre__playbook-generator/solution/solve.sh new file mode 100644 index 00000000..2543cf93 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-generator/solution/solve.sh @@ -0,0 +1,38 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Playbook Generation Report + +## Methodology +Used `create_vulnerability_playbook` MCP tool (remediations endpoint via Lightspeed) to generate the remediation playbook for CVE-2026-1234. The playbook is returned AS IS — unmodified from the generation tool output. No pre-flight, backup, or restart steps were added. + +## Generated Playbook (returned AS IS, unmodified) +```yaml +- hosts: affected_systems + tasks: + - block: + - name: Create boom snapshot + command: boom create --title "pre-cve-{{ cve_id }}" + - name: Apply patch + dnf: + name: '*' + state: latest + rescue: + - name: Rollback + command: boom rollback + always: + - name: Check reboot needed + command: needs-restarting -r + register: needs_restarting +``` + +## Key Patterns +- block/rescue/always for error handling +- needs-restarting -r for reboot detection (RHEL 8/9) +- boom create for kernel/snapshot before remediation + +## Failure Handling +If the create_vulnerability_playbook tool fails, do not auto-generate a playbook from general knowledge. Present the user with options: (A) Retry the tool, (B) Generate from knowledge with explicit user approval, or (C) Exit and escalate. + +## Execution +Do NOT run ansible-playbook directly. Delegate execution to the playbook-executor skill/workflow. +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__playbook-generator/task.toml b/evaluation/without_skills/rh-sre__playbook-generator/task.toml new file mode 100644 index 00000000..1ea4f1ac --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-generator/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__playbook-generator" +name = "rh-sre Playbook Generation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "playbook-generator", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__playbook-generator/tests/llm_judge.py b/evaluation/without_skills/rh-sre__playbook-generator/tests/llm_judge.py new file mode 100644 index 00000000..05cd660f --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-generator/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "mcp_tool_for_generation", "file": "/root/report.md", "question": "Does the report describe using an MCP tool (such as create_vulnerability_playbook or a remediations/Lightspeed endpoint) to generate the playbook, rather than writing it manually from scratch?", "reference": "A skilled report uses the Lightspeed MCP create_vulnerability_playbook tool. An unskilled report writes the playbook manually from general Ansible knowledge without using an MCP generation tool."}, + {"id": "return_as_is", "file": "/root/report.md", "question": "Does the report explicitly state that the generated playbook should be returned AS IS or unmodified, without adding extra steps like pre-flight checks, backup tasks, or restart handlers?", "reference": "A skilled report emphasizes returning the tool output unmodified. An unskilled report adds pre-flight checks, backup steps, restart handlers, or other enhancements to the generated playbook."}, + {"id": "delegation_not_execution", "file": "/root/report.md", "question": "Does the report explicitly state that playbook execution should be delegated to a separate execution workflow and NOT run directly via ansible-playbook?", "reference": "A skilled report delegates execution to a dedicated execution workflow rather than running ansible-playbook directly. An unskilled report runs ansible-playbook directly or doesn't address the execution boundary."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__playbook-generator/tests/test.sh b/evaluation/without_skills/rh-sre__playbook-generator/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-generator/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__playbook-generator/tests/test_outputs.py b/evaluation/without_skills/rh-sre__playbook-generator/tests/test_outputs.py new file mode 100644 index 00000000..00518d36 --- /dev/null +++ b/evaluation/without_skills/rh-sre__playbook-generator/tests/test_outputs.py @@ -0,0 +1,74 @@ +""" +Tests for rh-sre__playbook-generator per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['playbook', 'generat', 'cve']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_mcp_tool_for_generation(self): + """Skill: Use create_vulnerability_playbook MCP tool, not manual playbook writing.""" + c = read_report().lower() + assert any(t in c for t in [ + "create_vulnerability_playbook", "create_vuln_playbook", + "remediations", "lightspeed", + ]) and any(t in c for t in ["tool", "mcp", "generat"]), ( + "should reference MCP tool usage for playbook generation (not manual writing)" + ) + + def test_no_modifications_to_playbook(self): + """Skill: Return playbook AS IS, no modifications—never add pre-flight, backup, restart.""" + c = read_report().lower() + assert any(t in c for t in [ + "as is", "as-is", "unmodified", "do not modify", "no modification", + "unchanged", "without modification", "returned unchanged", + "original output", "generated output", + ]), "should return playbook unmodified (skill: no enhancements without user approval)" + + def test_no_auto_generate_on_failure(self): + """Skill: Never auto-generate playbooks from general knowledge without approval.""" + c = read_report().lower() + has_constraint = any(t in c for t in [ + "do not auto", "never auto", "not auto-generat", + "without approval", "explicit approval", "user approval", + "do not generat", "never generat", + ]) + has_options = any(t in c for t in ["retry", "option", "escalat"]) + assert has_constraint or has_options, ( + "should state not to auto-generate playbooks without user approval" + ) + + def test_delegation_to_executor(self): + """Skill: This skill ONLY generates; execution delegated to playbook-executor.""" + c = read_report().lower() + assert any(t in c for t in [ + 'delegat', 'executor', 'playbook-executor', 'hand off', + 'not execute', 'do not run', 'do not execute', + 'not run ansible-playbook', 'not ansible-playbook', + ]), "should delegate execution (not run ansible-playbook directly)" diff --git a/evaluation/without_skills/rh-sre__remediation-verifier/environment/Dockerfile b/evaluation/without_skills/rh-sre__remediation-verifier/environment/Dockerfile new file mode 100644 index 00000000..5a2cfdee --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation-verifier/environment/Dockerfile @@ -0,0 +1,43 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__remediation-verifier/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__remediation-verifier/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..e826c96e --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation-verifier/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,759 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def _system_profile_for_host(host_type: str, rhel_version: str, sid: int) -> dict: + """Generate system_profile fields for a host based on type and RHEL version.""" + el = "el9" if rhel_version.startswith("9") else "el8" + kernel = f"5.14.0-362.24.1.{el}_3.x86_64" if "9" in rhel_version else f"4.18.0-477.27.1.{el}.x86_64" + base_pkgs = [ + {"name": "kernel-core", "version": f"5.14.0-362.24.1.{el}.x86_64"}, + {"name": "httpd", "version": f"2.4.57-5.{el}"}, + {"name": "sshd", "version": f"8.9p1-23.{el}"}, + {"name": "firewalld", "version": f"1.2.5-4.{el}"}, + {"name": "systemd", "version": f"250-19.{el}"}, + ] + if "web" in host_type or "lb" in host_type: + base_pkgs.extend([ + {"name": "nginx", "version": f"1.24.1-3.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + elif "db" in host_type: + base_pkgs.extend([ + {"name": "postgresql", "version": f"15.4-1.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + elif "mon" in host_type: + base_pkgs.extend([ + {"name": "prometheus", "version": f"2.45.0-1.{el}"}, + {"name": "node_exporter", "version": f"1.6.1-2.{el}"}, + ]) + else: + base_pkgs.extend([ + {"name": "java-17-openjdk", "version": f"17.0.8-4.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + services = ["sshd.service", "firewalld.service", "chronyd.service"] + if "web" in host_type or "lb" in host_type: + services.append("httpd.service") + elif "db" in host_type: + services.extend(["postgresql.service", "postgresql-15.service"]) + elif "mon" in host_type: + services.extend(["prometheus.service", "node_exporter.service"]) + else: + services.append("httpd.service") + ip_octet = 10 + (sid % 245) + mac_hex = f"{(sid % 256):02x}" + return { + "installed_packages": base_pkgs[:8], + "running_services": services, + "network_interfaces": [ + {"name": "eth0", "ipv4": [f"10.0.1.{ip_octet}"], "mac": f"52:54:00:a1:b2:{mac_hex}"}, + {"name": "lo", "ipv4": ["127.0.0.1"], "mac": "00:00:00:00:00:00"}, + ], + "kernel_version": kernel, + } + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + # Add system_profile to each host + for idx, s in enumerate(systems): + host_type = "app" # default + for ht in ["web", "db", "app", "lb", "mon", "cache"]: + if ht in s["id"]: + host_type = ht + break + s["system_profile"] = _system_profile_for_host( + host_type, s["rhel_version"], idx + 1 + ) + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__remediation-verifier/instruction.md b/evaluation/without_skills/rh-sre__remediation-verifier/instruction.md new file mode 100644 index 00000000..845397fa --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation-verifier/instruction.md @@ -0,0 +1,18 @@ +# Post-Remediation Verification Task + +You are a Red Hat SRE. A CVE remediation playbook was recently executed against production systems. You need to verify that the fix was successfully applied and that the systems are healthy. + +## Scenario +The operations team ran a remediation playbook for CVE-2026-1234 against affected RHEL systems earlier today. You need to confirm the remediation actually worked and that no systems were left in a broken state. + +## Requirements +- Check whether the affected systems are still listed as vulnerable to the CVE +- Verify that the relevant packages have been updated to the fixed versions +- Confirm that critical services on each system are running and healthy after the patch +- Identify any systems where remediation failed or is incomplete +- For any failures, provide troubleshooting guidance (package conflicts, services not restarting, partial patches) +- Produce a verification summary: how many systems passed, how many failed, and any remaining action items + +Document your methodology, verification results, and assessment in `/root/report.md`. + +Use MCP tools to query system data. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-sre__remediation-verifier/solution/solve.sh b/evaluation/without_skills/rh-sre__remediation-verifier/solution/solve.sh new file mode 100644 index 00000000..49714723 --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation-verifier/solution/solve.sh @@ -0,0 +1,20 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Remediation Verification + +## CVE-2024-12345 Status +| System | RPM Status | CVE Status | Service Health | +|--------|-----------|------------|----------------| +| web-01 | installed >= fixed | Patched | Healthy | + +## Checks Performed +- get_cve_systems: System removed from affected list or status=patched +- get_host_details: system_profile.installed_packages >= expected fixed version +- systemd_failed_units: No service disruptions +- enabled_services, running_processes: verified + +## Notes +- Lightspeed inventory lag: up to 24 hours +- Recommend: insights-client --check-results to update inventory +- RPM comparison: installed version >= expected fixed version +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__remediation-verifier/task.toml b/evaluation/without_skills/rh-sre__remediation-verifier/task.toml new file mode 100644 index 00000000..23f81673 --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation-verifier/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__remediation-verifier" +name = "rh-sre Remediation Verification Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "remediation-verifier", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__remediation-verifier/tests/llm_judge.py b/evaluation/without_skills/rh-sre__remediation-verifier/tests/llm_judge.py new file mode 100644 index 00000000..15b8919b --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation-verifier/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "inventory_24h_lag", "file": "/root/report.md", "question": "Does the report note that Lightspeed inventory can take up to 24 hours to update and recommend insights-client --check-results for re-sync?", "reference": "A skilled report warns about inventory lag. An unskilled report expects immediate updates."}, + {"id": "system_profile_checks", "file": "/root/report.md", "question": "Does the report use get_host_details with include_system_profile for installed packages and service health verification?", "reference": "A skilled report uses system profile data. An unskilled report only checks CVE status."}, + {"id": "three_verification_layers", "file": "/root/report.md", "question": "Does the report verify at least 2 of: CVE status, package version, service health?", "reference": "A skilled report performs defense-in-depth verification. An unskilled report only checks one layer."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__remediation-verifier/tests/test.sh b/evaluation/without_skills/rh-sre__remediation-verifier/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation-verifier/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__remediation-verifier/tests/test_outputs.py b/evaluation/without_skills/rh-sre__remediation-verifier/tests/test_outputs.py new file mode 100644 index 00000000..00ddada6 --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation-verifier/tests/test_outputs.py @@ -0,0 +1,75 @@ +""" +Tests for rh-sre__remediation-verifier per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['verif', 'remediation', 'confirm']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_three_checks(self): + """Skill: Verify CVE status + package version + service health (defense in depth).""" + c = read_report().lower() + has_cve = any(t in c for t in ["cve", "vulnerab", "patched", "affected"]) + has_pkg = any(t in c for t in ["package", "version", "installed", "fixed"]) + has_svc = any(t in c for t in ["service", "running", "health", "enabled"]) + assert (has_cve and has_pkg) or (has_cve and has_svc) or (has_pkg and has_svc), ( + "should perform at least 2 of 3 checks (skill: CVE status, package, service)" + ) + + def test_package_version_comparison(self): + """Skill: Compare installed package version to expected fixed version (RPM-style).""" + c = read_report().lower() + has_compare = any(t in c for t in ["compare", "version", "expected", "installed"]) + has_fixed = any(t in c for t in ["fixed", "updated", "el8", "el9"]) + assert has_compare or has_fixed, ( + "should compare package versions (skill: verify_package_version)" + ) + + def test_inventory_24h_lag(self): + """Skill: Lightspeed inventory can take up to 24 hours to reflect updated package versions.""" + c = read_report().lower() + has_24 = "24" in c + has_timing = any(t in c for t in ["hour", "propagat", "delay"]) + assert has_24 and has_timing, ( + "should note inventory 24h lag (skill: Best Practices)" + ) + + def test_include_system_profile(self): + """Skill: get_host_details with include_system_profile: true returns installed_packages, enabled_services.""" + c = read_report().lower() + assert any(t in c for t in ["include_system_profile", "system_profile", "installed_packages"]), ( + "should reference include_system_profile for packages/services (skill)" + ) + + def test_insights_client_resync(self): + """Skill: insights-client --check-results triggers inventory re-sync.""" + c = read_report().lower() + assert any(t in c for t in ["insights-client", "check-results"]), ( + "should mention insights-client for inventory resync (skill)" + ) diff --git a/evaluation/without_skills/rh-sre__remediation/environment/Dockerfile b/evaluation/without_skills/rh-sre__remediation/environment/Dockerfile new file mode 100644 index 00000000..5a2cfdee --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation/environment/Dockerfile @@ -0,0 +1,43 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__remediation/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__remediation/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..2269a235 --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,722 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2026-1234": { + "cve_id": "CVE-2026-1234", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Critical kernel vulnerability: remote code execution in kernel network stack allows unauthenticated attackers to execute arbitrary code via crafted packets", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2026-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 8, + "total_remediated": 2, + "total_vulnerable": 6, + }, + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__remediation/instruction.md b/evaluation/without_skills/rh-sre__remediation/instruction.md new file mode 100644 index 00000000..ffd80028 --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation/instruction.md @@ -0,0 +1,19 @@ +# CVE Remediation Workflow Task + +You are a Red Hat SRE. A critical CVE has been reported and you need to plan and document a complete end-to-end remediation workflow, from initial validation through execution and verification. + +## Scenario +CVE-2026-1234 (Critical, CVSS 9.8) has been identified as affecting production RHEL systems in your fleet. Management wants a comprehensive remediation plan that covers every phase of the response. + +## Requirements +- Validate the CVE: confirm it is real, assess its severity, and determine if a remediation is available +- Assess the impact: identify which systems are affected and their criticality +- Gather system context: understand each affected system's role, dependencies, and constraints before patching +- Plan playbook generation: how the remediation playbook will be created +- Plan execution: how the playbook will be run (dry-run first, then production), including approval gates and rollback strategy +- Plan verification: how you will confirm remediation was successful after execution +- Present a phased workflow with clear decision points and user confirmation steps at each gate + +Document the complete workflow plan in `/root/report.md`. + +Use MCP tools to query data. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-sre__remediation/solution/solve.sh b/evaluation/without_skills/rh-sre__remediation/solution/solve.sh new file mode 100644 index 00000000..2721e5ff --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation/solution/solve.sh @@ -0,0 +1,21 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# Remediation Plan + +## Orchestration Order +1. Validate MCP connectivity +2. CVE impact analysis +3. Validate CVE remediation availability +4. Gather system context +5. Generate playbook +6. Execute playbook +7. Verify remediation + +## CVE-2024-12345 +- Remediatable: Yes +- Systems: 4 production +- Template: Kernel update with boom snapshot + +## Execution +Wait for user confirmation (yes/proceed) before Step 5 (Execute playbook). Dry-run first, then production run. +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__remediation/task.toml b/evaluation/without_skills/rh-sre__remediation/task.toml new file mode 100644 index 00000000..1922d4d5 --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__remediation" +name = "rh-sre CVE Remediation Planning Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "remediation", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__remediation/tests/llm_judge.py b/evaluation/without_skills/rh-sre__remediation/tests/llm_judge.py new file mode 100644 index 00000000..c5278840 --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "remediation_gate", "file": "/root/report.md", "question": "Does the report gate on remediation availability (checking whether automated remediation is possible for a CVE) before proceeding with playbook generation?", "reference": "A skilled report checks whether automated remediation is available as a prerequisite gate before attempting playbook generation. An unskilled report proceeds to generate playbooks without first verifying that remediation is available for the target CVEs."}, + {"id": "plan_before_execution", "file": "/root/report.md", "question": "Does the report present a Remediation Plan with summary/table/checklist for user confirmation before execution?", "reference": "A skilled report requires plan validation before execution. An unskilled report executes without plan review."}, + {"id": "two_part_confirmation", "file": "/root/report.md", "question": "Does the report describe two distinct confirmation checkpoints: one BEFORE starting (upfront planned tasks / Part A) and one AFTER playbook generation but BEFORE execution (execution plan / Part B)?", "reference": "A skilled report has Part A (upfront planned tasks before any remediation step) and Part B (execution plan confirmation after playbook is generated but before running it). An unskilled report has at most one confirmation checkpoint or no structured confirmation phases."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__remediation/tests/test.sh b/evaluation/without_skills/rh-sre__remediation/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__remediation/tests/test_outputs.py b/evaluation/without_skills/rh-sre__remediation/tests/test_outputs.py new file mode 100644 index 00000000..bad4f7c8 --- /dev/null +++ b/evaluation/without_skills/rh-sre__remediation/tests/test_outputs.py @@ -0,0 +1,78 @@ +""" +Tests for rh-sre__remediation per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['remediation', 'orchestrat', 'workflow']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_seven_step_sequence(self): + """Skill: Orchestrate in order: validate MCP → impact → validate CVE → context → playbook → execute → verify.""" + c = read_report().lower() + has_sequence = any(t in c for t in ["validate", "impact", "context", "playbook", "execute", "verify"]) + has_order = any(t in c for t in ["step", "phase", "before", "workflow order", "sequence"]) + assert has_sequence and has_order, ( + "should define 7-step orchestration sequence (skill: workflow order)" + ) + + def test_remediatable_gate(self): + """Skill: Gate on cve-validation: if not remediatable, stop or warn before playbook generation.""" + c = read_report().lower() + has_gate = any(t in c for t in ["remediat", "gate", "remediation_available", "advisory"]) + has_stop = any(t in c for t in ["stop", "cannot proceed", "no automated", "manual"]) + assert has_gate or has_stop, ( + "should gate on remediation availability (skill: Remediatable Gate)" + ) + + def test_plan_validation_before_execute(self): + """Skill: Present Remediation Plan (summary, table, checklist); wait for user yes/proceed before Step 5.""" + c = read_report().lower() + has_plan = any(t in c for t in ["plan", "checklist", "summary", "table"]) + has_confirm = any(t in c for t in ["confirm", "proceed", "approval", "yes", "abort"]) + assert has_plan and has_confirm, ( + "should require plan validation before execution (skill: Remediation Plan)" + ) + + def test_dry_run_recommendation(self): + """Skill: Recommend dry-run first; wait for explicit approval before actual execution.""" + c = read_report().lower() + assert any(t in c for t in ["dry-run", "dry run", "check mode", "preview"]), ( + "should recommend dry-run first (skill: before Step 5)" + ) + + def test_two_part_confirmation(self): + """Docs teach Part A (pre-Step-0) and Part B (post-Step-4) confirmations + with ordered step completion marking. Without docs, agents use single confirmation.""" + c = read_report().lower() + assert any(t in c for t in [ + "part a", "part b", "pre-step", "post-step", "two-part", + "before step 0", "after step 4", + ]) or ("confirm" in c and "step" in c), ( + "should use two-part confirmation (Part A pre-Step-0, Part B post-Step-4)" + ) diff --git a/evaluation/without_skills/rh-sre__system-context/environment/Dockerfile b/evaluation/without_skills/rh-sre__system-context/environment/Dockerfile new file mode 100644 index 00000000..5a2cfdee --- /dev/null +++ b/evaluation/without_skills/rh-sre__system-context/environment/Dockerfile @@ -0,0 +1,43 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "lightspeed-mcp": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-lightspeed-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-sre__system-context/environment/mcp-servers/mock-lightspeed-mcp.py b/evaluation/without_skills/rh-sre__system-context/environment/mcp-servers/mock-lightspeed-mcp.py new file mode 100644 index 00000000..e826c96e --- /dev/null +++ b/evaluation/without_skills/rh-sre__system-context/environment/mcp-servers/mock-lightspeed-mcp.py @@ -0,0 +1,759 @@ +#!/usr/bin/env python3 +""" +Mock Lightspeed MCP Server + +Simulates the Red Hat Lightspeed MCP server for the rh-sre-fleet-inventory +benchmark task. Implements the MCP protocol via FastMCP so that agents can +call get_host_details, get_cve_systems, get_cves, get_cve, and +create_vulnerability_playbook as real MCP tools. + +Fleet composition (63 systems total): + - 30 production (web, db, app, lb, monitoring, cache) + - 15 staging + - 10 development + - 5 QA + - 3 legacy (ambiguous tags — no explicit environment) + +CVE data (5 CVEs): + - CVE-2024-12345 Critical 9.8 RCE in HTTP processing + - CVE-2024-54321 Important 7.5 SQL injection in DB parser + - CVE-2024-11111 Moderate 5.3 Info disclosure in logging + - CVE-2024-98765 Important 8.1 DoS in load balancer + - CVE-2024-22222 Low 3.1 Info disclosure in monitoring +""" + +import os +import random +from datetime import datetime, timedelta +from typing import Optional + +from fastmcp import FastMCP + +random.seed(42) + +REFERENCE_TIME = datetime(2026, 2, 15, 12, 0, 0) + + +# --------------------------------------------------------------------------- +# Mock fleet data +# --------------------------------------------------------------------------- + +def _ts(delta: timedelta) -> str: + """Return an ISO timestamp offset from REFERENCE_TIME.""" + return (REFERENCE_TIME - delta).isoformat() + "Z" + + +def _system_profile_for_host(host_type: str, rhel_version: str, sid: int) -> dict: + """Generate system_profile fields for a host based on type and RHEL version.""" + el = "el9" if rhel_version.startswith("9") else "el8" + kernel = f"5.14.0-362.24.1.{el}_3.x86_64" if "9" in rhel_version else f"4.18.0-477.27.1.{el}.x86_64" + base_pkgs = [ + {"name": "kernel-core", "version": f"5.14.0-362.24.1.{el}.x86_64"}, + {"name": "httpd", "version": f"2.4.57-5.{el}"}, + {"name": "sshd", "version": f"8.9p1-23.{el}"}, + {"name": "firewalld", "version": f"1.2.5-4.{el}"}, + {"name": "systemd", "version": f"250-19.{el}"}, + ] + if "web" in host_type or "lb" in host_type: + base_pkgs.extend([ + {"name": "nginx", "version": f"1.24.1-3.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + elif "db" in host_type: + base_pkgs.extend([ + {"name": "postgresql", "version": f"15.4-1.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + elif "mon" in host_type: + base_pkgs.extend([ + {"name": "prometheus", "version": f"2.45.0-1.{el}"}, + {"name": "node_exporter", "version": f"1.6.1-2.{el}"}, + ]) + else: + base_pkgs.extend([ + {"name": "java-17-openjdk", "version": f"17.0.8-4.{el}"}, + {"name": "openssl", "version": f"3.0.7-24.{el}"}, + ]) + services = ["sshd.service", "firewalld.service", "chronyd.service"] + if "web" in host_type or "lb" in host_type: + services.append("httpd.service") + elif "db" in host_type: + services.extend(["postgresql.service", "postgresql-15.service"]) + elif "mon" in host_type: + services.extend(["prometheus.service", "node_exporter.service"]) + else: + services.append("httpd.service") + ip_octet = 10 + (sid % 245) + mac_hex = f"{(sid % 256):02x}" + return { + "installed_packages": base_pkgs[:8], + "running_services": services, + "network_interfaces": [ + {"name": "eth0", "ipv4": [f"10.0.1.{ip_octet}"], "mac": f"52:54:00:a1:b2:{mac_hex}"}, + {"name": "lo", "ipv4": ["127.0.0.1"], "mac": "00:00:00:00:00:00"}, + ], + "kernel_version": kernel, + } + + +def generate_mock_systems() -> list[dict]: + """Generate 63 mock systems with realistic distribution.""" + systems: list[dict] = [] + sid = 1 + + # --- Production (30) --------------------------------------------------- + + # Web servers (8) + for i in range(1, 9): + rhel = "9.3" if i <= 5 else ("9.2" if i <= 7 else "8.9") + stale = i == 7 + tags = ["production", "web-tier"] + if i <= 4: + tags.extend(["customer-facing", "pci-compliant", "high-availability"]) + if i <= 2: + tags.append("critical") + systems.append({ + "id": f"sys-{sid:03d}-web-prod-{i:02d}", + "display_name": f"web-server-{i:02d}.prod.example.com", + "fqdn": f"web-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=9) if stale else timedelta(hours=random.randint(1, 20))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Database servers (6) + for i in range(1, 7): + rhel = "9.3" if i <= 4 else "9.2" + stale = i == 5 + tags = ["production", "database-tier", "critical"] + if i <= 4: + tags.extend(["pci-compliant", "soc2-compliant", "high-availability"]) + if i <= 2: + tags.append("customer-data") + systems.append({ + "id": f"sys-{sid:03d}-db-prod-{i:02d}", + "display_name": f"db-server-{i:02d}.prod.example.com", + "fqdn": f"db-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=10) if stale else timedelta(hours=random.randint(1, 18))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Application servers (10) + for i in range(1, 11): + rhel = "8.9" if i <= 6 else ("9.2" if i <= 8 else "9.3") + stale = i == 9 + tags = ["production", "app-tier"] + if i <= 3: + tags.extend(["customer-facing", "pci-compliant", "soc2-compliant"]) + elif i <= 6: + tags.extend(["hipaa-compliant", "soc2-compliant"]) + if i <= 5: + tags.append("high-availability") + systems.append({ + "id": f"sys-{sid:03d}-app-prod-{i:02d}", + "display_name": f"app-server-{i:02d}.prod.example.com", + "fqdn": f"app-server-{i:02d}.prod.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=8) if stale else timedelta(hours=random.randint(1, 22))), + "tags": tags, + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + # Load balancers (3) + for i in range(1, 4): + tags = ["production", "loadbalancer", "critical", "high-availability"] + if i <= 2: + tags.append("customer-facing") + systems.append({ + "id": f"sys-{sid:03d}-lb-prod-{i:02d}", + "display_name": f"lb-server-{i:02d}.prod.example.com", + "fqdn": f"lb-server-{i:02d}.prod.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(1, 12))), + "tags": tags, + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Monitoring (2) + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-mon-prod-{i:02d}", + "display_name": f"monitor-server-{i:02d}.prod.example.com", + "fqdn": f"monitor-server-{i:02d}.prod.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=random.randint(1, 6))), + "tags": ["production", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # Cache (1) — stale + systems.append({ + "id": f"sys-{sid:03d}-cache-prod-01", + "display_name": "cache-server-01.prod.example.com", + "fqdn": "cache-server-01.prod.example.com", + "rhel_version": "8.9", + "last_seen": _ts(timedelta(days=11)), + "tags": ["production", "cache-tier"], + "stale": True, + "satellite_managed": False, + }) + sid += 1 + + # --- Staging (15) ------------------------------------------------------ + + for i in range(1, 5): + rhel = "9.3" if i <= 2 else "9.2" + stale = i == 3 + systems.append({ + "id": f"sys-{sid:03d}-web-stg-{i:02d}", + "display_name": f"web-server-{i:02d}.staging.example.com", + "fqdn": f"web-server-{i:02d}.staging.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=12) if stale else timedelta(hours=random.randint(2, 20))), + "tags": ["staging", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 4): + systems.append({ + "id": f"sys-{sid:03d}-db-stg-{i:02d}", + "display_name": f"db-server-{i:02d}.staging.example.com", + "fqdn": f"db-server-{i:02d}.staging.example.com", + "rhel_version": "9.3" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(3, 18))), + "tags": ["staging", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 6): + systems.append({ + "id": f"sys-{sid:03d}-app-stg-{i:02d}", + "display_name": f"app-server-{i:02d}.staging.example.com", + "fqdn": f"app-server-{i:02d}.staging.example.com", + "rhel_version": "8.9" if i <= 3 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(4, 22))), + "tags": ["staging", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-lb-stg-{i:02d}", + "display_name": f"lb-server-{i:02d}.staging.example.com", + "fqdn": f"lb-server-{i:02d}.staging.example.com", + "rhel_version": "8.8", + "last_seen": _ts(timedelta(hours=random.randint(2, 16))), + "tags": ["staging", "loadbalancer"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-stg-01", + "display_name": "monitor-server-01.staging.example.com", + "fqdn": "monitor-server-01.staging.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(hours=8)), + "tags": ["staging", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Development (10) -------------------------------------------------- + + for i in range(1, 4): + rhel = "9.2" if i == 1 else "8.9" + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-dev-{i:02d}", + "display_name": f"web-server-{i:02d}.dev.example.com", + "fqdn": f"web-server-{i:02d}.dev.example.com", + "rhel_version": rhel, + "last_seen": _ts(timedelta(days=15) if stale else timedelta(hours=random.randint(5, 23))), + "tags": ["development", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-db-dev-{i:02d}", + "display_name": f"db-server-{i:02d}.dev.example.com", + "fqdn": f"db-server-{i:02d}.dev.example.com", + "rhel_version": "9.3" if i == 1 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(6, 20))), + "tags": ["development", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 5): + systems.append({ + "id": f"sys-{sid:03d}-app-dev-{i:02d}", + "display_name": f"app-server-{i:02d}.dev.example.com", + "fqdn": f"app-server-{i:02d}.dev.example.com", + "rhel_version": "8.9" if i <= 2 else "9.2", + "last_seen": _ts(timedelta(hours=random.randint(8, 22))), + "tags": ["development", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-mon-dev-01", + "display_name": "monitor-server-01.dev.example.com", + "fqdn": "monitor-server-01.dev.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=10)), + "tags": ["development", "monitoring"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- QA (5) ------------------------------------------------------------ + + for i in range(1, 3): + stale = i == 2 + systems.append({ + "id": f"sys-{sid:03d}-web-qa-{i:02d}", + "display_name": f"web-server-{i:02d}.qa.example.com", + "fqdn": f"web-server-{i:02d}.qa.example.com", + "rhel_version": "9.3", + "last_seen": _ts(timedelta(days=14) if stale else timedelta(hours=random.randint(4, 18))), + "tags": ["qa", "web-tier"], + "stale": stale, + "satellite_managed": False, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-db-qa-01", + "display_name": "db-server-01.qa.example.com", + "fqdn": "db-server-01.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=12)), + "tags": ["qa", "database-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + for i in range(1, 3): + systems.append({ + "id": f"sys-{sid:03d}-app-qa-{i:02d}", + "display_name": f"app-server-{i:02d}.qa.example.com", + "fqdn": f"app-server-{i:02d}.qa.example.com", + "rhel_version": "9.2", + "last_seen": _ts(timedelta(hours=random.randint(5, 19))), + "tags": ["qa", "app-tier"], + "stale": False, + "satellite_managed": False, + }) + sid += 1 + + # --- Legacy (3) — ambiguous tags, no explicit environment -------------- + + systems.append({ + "id": f"sys-{sid:03d}-legacy-payment-01", + "display_name": "legacy-payment-gw.example.com", + "fqdn": "legacy-payment-gw.example.com", + "rhel_version": "8.7", + "last_seen": _ts(timedelta(hours=3)), + "tags": ["legacy-system", "payment-gateway", "critical"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-reports-01", + "display_name": "reports-legacy.example.com", + "fqdn": "reports-legacy.example.com", + "rhel_version": "8.6", + "last_seen": _ts(timedelta(days=6)), + "tags": ["legacy-system", "reporting", "financial-data"], + "stale": False, + "satellite_managed": True, + }) + sid += 1 + + systems.append({ + "id": f"sys-{sid:03d}-legacy-archive-01", + "display_name": "archive-01.legacy.example.com", + "fqdn": "archive-01.legacy.example.com", + "rhel_version": "8.5", + "last_seen": _ts(timedelta(days=30)), + "tags": ["legacy-system", "archive", "read-only"], + "stale": True, + "satellite_managed": True, + }) + sid += 1 + + # Add system_profile to each host + for idx, s in enumerate(systems): + host_type = "app" # default + for ht in ["web", "db", "app", "lb", "mon", "cache"]: + if ht in s["id"]: + host_type = ht + break + s["system_profile"] = _system_profile_for_host( + host_type, s["rhel_version"], idx + 1 + ) + + return systems + + +MOCK_SYSTEMS = generate_mock_systems() + +# --------------------------------------------------------------------------- +# Mock CVE data +# --------------------------------------------------------------------------- + +MOCK_CVE_DATA = { + "CVE-2024-12345": { + "cve_id": "CVE-2024-12345", + "severity": "Critical", + "cvss_score": 9.8, + "description": "Remote code execution vulnerability in HTTP request processing", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-03-15", + "compliance_notes": ( + "PCI-DSS 6.2 requires critical vulnerabilities be patched within " + "30 days for systems handling cardholder data" + ), + "affected_systems": [ + {"system_id": "sys-001-web-prod-01", "display_name": "web-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-002-web-prod-02", "display_name": "web-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-003-web-prod-03", "display_name": "web-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-004-web-prod-04", "display_name": "web-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-031-web-stg-01", "display_name": "web-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-032-web-stg-02", "display_name": "web-server-02.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 6, + "total_remediated": 2, + "total_vulnerable": 4, + }, + "CVE-2024-54321": { + "cve_id": "CVE-2024-54321", + "severity": "Important", + "cvss_score": 7.5, + "description": "SQL injection vulnerability in database query parser", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-04-30", + "compliance_notes": ( + "PCI-DSS 6.2 requires high-risk vulnerabilities be patched within " + "90 days. Affects systems storing cardholder data." + ), + "affected_systems": [ + {"system_id": "sys-009-db-prod-01", "display_name": "db-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-010-db-prod-02", "display_name": "db-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-011-db-prod-03", "display_name": "db-server-03.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-012-db-prod-04", "display_name": "db-server-04.prod.example.com", "status": "Patched", "remediation_available": True}, + {"system_id": "sys-035-db-stg-01", "display_name": "db-server-01.staging.example.com", "status": "Vulnerable", "remediation_available": True}, + ], + "total_affected": 5, + "total_remediated": 2, + "total_vulnerable": 3, + }, + "CVE-2024-11111": { + "cve_id": "CVE-2024-11111", + "severity": "Moderate", + "cvss_score": 5.3, + "description": "Information disclosure in application logging", + "pci_impact": True, + "soc2_impact": True, + "hipaa_impact": True, + "compliance_deadline": "2024-06-30", + "compliance_notes": ( + "HIPAA requires remediation of vulnerabilities exposing PHI. " + "PCI-DSS allows longer timelines for moderate risks." + ), + "affected_systems": [ + # 6 vulnerable production app servers + {"system_id": f"sys-{15+i:03d}-app-prod-{i:02d}", "display_name": f"app-server-{i:02d}.prod.example.com", + "status": "Vulnerable", "remediation_available": True} + for i in range(1, 7) + ] + [ + # 2 affected-but-not-vulnerable production app servers + {"system_id": "sys-022-app-prod-07", "display_name": "app-server-07.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "SELinux policy prevents exploitation of logging vulnerability"}, + {"system_id": "sys-023-app-prod-08", "display_name": "app-server-08.prod.example.com", + "status": "Affected but not vulnerable", "remediation_available": True, + "mitigation_reason": "Application logging feature is disabled in configuration"}, + ] + [ + # 3 patched staging app servers + {"system_id": f"sys-{40+i:03d}-app-stg-{i:02d}", "display_name": f"app-server-{i:02d}.staging.example.com", + "status": "Patched", "remediation_available": True} + for i in range(1, 4) + ], + "total_affected": 11, + "total_remediated": 3, + "total_vulnerable": 6, + }, + "CVE-2024-98765": { + "cve_id": "CVE-2024-98765", + "severity": "Important", + "cvss_score": 8.1, + "description": "Denial of service vulnerability in load balancer traffic handling", + "pci_impact": False, + "soc2_impact": True, + "hipaa_impact": False, + "compliance_deadline": "2024-05-15", + "compliance_notes": ( + "SOC2 CC7.1 requires protection of system availability. " + "Critical infrastructure should be patched urgently." + ), + "affected_systems": [ + {"system_id": "sys-025-lb-prod-01", "display_name": "lb-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-026-lb-prod-02", "display_name": "lb-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-027-lb-prod-03", "display_name": "lb-server-03.prod.example.com", "status": "Vulnerable", "remediation_available": True}, + {"system_id": "sys-045-lb-stg-01", "display_name": "lb-server-01.staging.example.com", "status": "Patched", "remediation_available": True}, + ], + "total_affected": 4, + "total_remediated": 1, + "total_vulnerable": 3, + }, + "CVE-2024-22222": { + "cve_id": "CVE-2024-22222", + "severity": "Low", + "cvss_score": 3.1, + "description": "Minor information disclosure in monitoring agent error messages", + "pci_impact": False, + "soc2_impact": False, + "hipaa_impact": False, + "compliance_deadline": None, + "compliance_notes": "Low severity, no immediate compliance impact. Patch during regular maintenance window.", + "affected_systems": [ + {"system_id": "sys-028-mon-prod-01", "display_name": "monitor-server-01.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + {"system_id": "sys-029-mon-prod-02", "display_name": "monitor-server-02.prod.example.com", "status": "Vulnerable", "remediation_available": False}, + ], + "total_affected": 2, + "total_remediated": 0, + "total_vulnerable": 2, + }, +} + + +# --------------------------------------------------------------------------- +# MCP server +# --------------------------------------------------------------------------- + +mcp = FastMCP("lightspeed-mcp") + + +@mcp.tool +def get_host_details( + system_id: Optional[str] = None, + hostname_pattern: Optional[str] = None, + tags: Optional[list[str]] = None, + rhel_version_prefix: Optional[str] = None, +) -> dict: + """Retrieve registered system inventory from Red Hat Lightspeed. + + Returns all systems when called with no arguments. Supports filtering + by system ID, hostname pattern, tags, and RHEL version prefix. + + Args: + system_id: Return only the system matching this ID. + hostname_pattern: Filter by hostname (supports * wildcards). + tags: Filter to systems having at least one of these tags. + rhel_version_prefix: Filter by RHEL version prefix (e.g. "8" or "9.3"). + """ + filtered = list(MOCK_SYSTEMS) + + if system_id: + filtered = [s for s in filtered if s["id"] == system_id] + + if hostname_pattern: + pattern = hostname_pattern.replace("*", "") + filtered = [s for s in filtered if pattern in s["fqdn"]] + + if tags: + filtered = [ + s for s in filtered + if any(t in s.get("tags", []) for t in tags) + ] + + if rhel_version_prefix: + filtered = [ + s for s in filtered + if s["rhel_version"].startswith(rhel_version_prefix) + ] + + return { + "systems": filtered, + "total": len(MOCK_SYSTEMS), + "count": len(filtered), + } + + +@mcp.tool +def get_cve_systems(cve_id: str) -> dict: + """Find systems affected by a specific CVE. + + Returns affected systems with their vulnerability status + (Vulnerable, Patched, or Affected but not vulnerable) and + whether automated remediation is available. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id in MOCK_CVE_DATA: + return MOCK_CVE_DATA[cve_id] + + return { + "cve_id": cve_id, + "affected_systems": [], + "total_affected": 0, + "total_remediated": 0, + } + + +@mcp.tool +def get_cves() -> dict: + """List all known CVEs affecting the fleet. + + Returns summary information for every CVE including severity, + CVSS score, affected/vulnerable counts, and compliance impact. + """ + summaries = [] + for cve in MOCK_CVE_DATA.values(): + entry = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + "remediation_available": any( + s.get("remediation_available", False) + for s in cve["affected_systems"] + ), + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + } + if "total_vulnerable" in cve: + entry["total_vulnerable"] = cve["total_vulnerable"] + summaries.append(entry) + return {"cves": summaries, "total": len(summaries)} + + +@mcp.tool +def get_cve(cve_id: str) -> dict: + """Get detailed information about a specific CVE. + + Returns full CVE metadata including severity, CVSS score, description, + compliance impact, and deadline — but not the per-system breakdown. + Use get_cve_systems for that. + + Args: + cve_id: CVE identifier in CVE-YYYY-NNNNN format. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + result = { + "cve_id": cve["cve_id"], + "severity": cve["severity"], + "cvss_score": cve["cvss_score"], + "description": cve["description"], + "pci_impact": cve["pci_impact"], + "soc2_impact": cve["soc2_impact"], + "hipaa_impact": cve["hipaa_impact"], + "compliance_deadline": cve["compliance_deadline"], + "compliance_notes": cve["compliance_notes"], + "total_affected": cve["total_affected"], + "total_remediated": cve["total_remediated"], + } + if "total_vulnerable" in cve: + result["total_vulnerable"] = cve["total_vulnerable"] + return result + + +@mcp.tool +def create_vulnerability_playbook( + cve_id: str, + system_ids: Optional[list[str]] = None, +) -> dict: + """Generate an Ansible remediation playbook for a CVE. + + Creates a playbook targeting the specified systems (or all vulnerable + systems if none specified). Returns the playbook content and metadata. + + Args: + cve_id: CVE identifier to remediate. + system_ids: Specific system IDs to target. Omit for all vulnerable. + """ + if cve_id not in MOCK_CVE_DATA: + return {"error": f"CVE {cve_id} not found"} + + cve = MOCK_CVE_DATA[cve_id] + if not any(s.get("remediation_available") for s in cve["affected_systems"]): + return { + "error": "No automated remediation available for this CVE", + "cve_id": cve_id, + } + + targets = system_ids or [ + s["system_id"] + for s in cve["affected_systems"] + if s["status"] == "Vulnerable" + ] + + return { + "cve_id": cve_id, + "playbook_id": f"playbook-{cve_id.lower()}-mock", + "target_systems": targets, + "target_count": len(targets), + "status": "generated", + "playbook_content": ( + f"# Auto-generated remediation playbook for {cve_id}\n" + f"# Targets: {len(targets)} systems\n" + f"---\n" + f"- hosts: targeted_systems\n" + f" become: true\n" + f" tasks:\n" + f" - name: Apply patch for {cve_id}\n" + f" dnf:\n" + f" name: '*'\n" + f" state: latest\n" + f" security: true\n" + ), + } + + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-sre__system-context/instruction.md b/evaluation/without_skills/rh-sre__system-context/instruction.md new file mode 100644 index 00000000..95d0540e --- /dev/null +++ b/evaluation/without_skills/rh-sre__system-context/instruction.md @@ -0,0 +1,16 @@ +# System Context Task + +You are a Red Hat SRE. Before rolling out a remediation for a critical vulnerability, you need to gather comprehensive context about the affected systems to make safe remediation decisions. + +## Scenario +A high-severity advisory has been identified that affects multiple systems in your fleet. Before applying any patches, you need to understand each affected system's role, current health, installed packages, running services, and any special constraints (maintenance windows, compliance requirements, dependencies). + +## Requirements +- Use MCP tools to query systems in the fleet and identify those affected by the advisory +- For each affected system, gather: system role and criticality, current health and uptime, installed package versions relevant to the advisory, running services that may be impacted, and any compliance or scheduling constraints +- Assess which systems can be patched immediately vs. which need coordination +- Identify dependencies between systems that affect remediation ordering + +Document your system context analysis and remediation readiness assessment in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-sre__system-context/solution/solve.sh b/evaluation/without_skills/rh-sre__system-context/solution/solve.sh new file mode 100644 index 00000000..94c4eb6d --- /dev/null +++ b/evaluation/without_skills/rh-sre__system-context/solution/solve.sh @@ -0,0 +1,19 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# System Context Report + +## Affected Systems +| System | RHEL | Environment | Infrastructure | Tags | +|--------|------|-------------|----------------|------| +| web-01 | 9.3 | Production | bare_metal | pci-compliant | +| db-01 | 8.9 | Staging | virtualized | - | + +## Data Source +get_cve_systems + get_host_details with include_system_profile=true. system_profile: rhel_version, infrastructure_type, installed_packages. + +## Remediation Strategy (Decision Matrix) +- Deployment type: Batch (multiple systems) +- Infrastructure: Bare metal, virtualized +- Maintenance window: Required for production +- Kubernetes: Rolling update with pod eviction if K8s nodes +REPORT_EOF diff --git a/evaluation/without_skills/rh-sre__system-context/task.toml b/evaluation/without_skills/rh-sre__system-context/task.toml new file mode 100644 index 00000000..d060c445 --- /dev/null +++ b/evaluation/without_skills/rh-sre__system-context/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-sre__system-context" +name = "rh-sre System Context Gathering Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-sre", "system-context", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-sre__system-context/tests/llm_judge.py b/evaluation/without_skills/rh-sre__system-context/tests/llm_judge.py new file mode 100644 index 00000000..c2970b3d --- /dev/null +++ b/evaluation/without_skills/rh-sre__system-context/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "infrastructure_classification", "file": "/root/report.md", "question": "Does the report classify systems by infrastructure_type (bare_metal/virtualized/container) and infrastructure_vendor?", "reference": "A skilled report uses infrastructure classification fields. An unskilled report doesn't distinguish infrastructure types."}, + {"id": "kubernetes_safety_context", "file": "/root/report.md", "question": "Does the report consider Kubernetes context (PDBs, daemonsets) for safe remediation planning?", "reference": "A skilled report checks hasPdbs and daemonsets for safety. An unskilled report ignores K8s workload context."}, + {"id": "staged_rollout", "file": "/root/report.md", "question": "Does the report recommend staged rollout (staging first, then production batches) based on environment criticality?", "reference": "A skilled report follows staged rollout pattern. An unskilled report patches all systems simultaneously."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-sre__system-context/tests/test.sh b/evaluation/without_skills/rh-sre__system-context/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-sre__system-context/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-sre__system-context/tests/test_outputs.py b/evaluation/without_skills/rh-sre__system-context/tests/test_outputs.py new file mode 100644 index 00000000..ff39869d --- /dev/null +++ b/evaluation/without_skills/rh-sre__system-context/tests/test_outputs.py @@ -0,0 +1,84 @@ +""" +Tests for rh-sre__system-context per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_topic(self): + content = read_report().lower() + assert any(t in content for t in ['system', 'context', 'environment']), ( + "report should mention key topic" + ) + + def test_report_has_structure(self): + content = read_report() + assert len(content) > 150, "report should have substantial content" + + +class TestSkillDependent: + def test_remediation_strategy_by_context(self): + """Skill: Determine strategy from context: batch vs rolling, maintenance window, pod eviction for K8s.""" + c = read_report().lower() + has_strategy = any(t in c for t in ["strategy", "approach", "rolling", "batch"]) + has_context = any(t in c for t in ["maintenance", "pod eviction", "kubernetes", "staging first"]) + assert has_strategy and has_context, ( + "should derive strategy from context (skill: Decision Matrix)" + ) + + def test_rhel_version_distribution(self): + """Skill: Report RHEL version distribution (playbook must support multiple versions).""" + c = read_report().lower() + assert any(t in c for t in ['rhel', 'version', 'distribution', 'el7', 'el8', 'el9']), ( + "Should report RHEL version distribution (skill: conditional dnf/yum)" + ) + + def test_environment_and_criticality(self): + """Skill: Classify by environment (prod/staging/dev) and criticality for rollout order.""" + c = read_report().lower() + has_env = any(t in c for t in ["staging", "development", "rollout_order", "rollout order"]) + has_crit = any(t in c for t in ["critical", "criticality", "priority", "high", "rollout"]) + assert has_env and has_crit, ( + "should classify by environment and criticality (skill: rollout_order)" + ) + + def test_infrastructure_classification(self): + """Skill: infrastructure_type (bare_metal/virtualized/container) and infrastructure_vendor (kvm) fields.""" + c = read_report().lower() + has_type = any(t in c for t in ["infrastructure_type", "infrastructure_vendor", "virtualized"]) + has_bare = "bare_metal" in c or "bare metal" in c + assert has_type or has_bare, ( + "should reference infrastructure classification (skill: bare_metal/virtualized/container)" + ) + + def test_kubernetes_context_fields(self): + """Skill: hasPdbs and daemonsets_present for safety planning in K8s context.""" + c = read_report().lower() + has_k8s = any(t in c for t in ["pdb", "daemonset"]) + has_safety = any(t in c for t in ["safety", "eviction"]) + assert has_k8s and has_safety, ( + "should reference PDB/daemonset for K8s safety (skill)" + ) + + def test_needs_restarting_check(self): + """Docs teach needs-restarting -r (exit code 0=no reboot, 1=reboot needed) + and -s for services needing restart. Without docs, agents skip this check.""" + c = read_report().lower() + assert any(t in c for t in [ + "needs-restarting", "needs_restarting", "reboot", "restart service", + ]), "should use needs-restarting for reboot/service restart assessment" diff --git a/evaluation/without_skills/rh-virt__vm-clone/environment/Dockerfile b/evaluation/without_skills/rh-virt__vm-clone/environment/Dockerfile new file mode 100644 index 00000000..a76f03e8 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-clone/environment/Dockerfile @@ -0,0 +1,50 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-virt__vm-clone/environment/mcp-servers/mock-virt-mcp.py b/evaluation/without_skills/rh-virt__vm-clone/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..70ce07d7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-clone/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1465 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("prod-vms", {"env": "production"}), + ("test-env", {"env": "testing"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── prod-vms (instruction-specific) ────────────────────────────────── + _vm("production-db", "prod-vms", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true"}, + 8, 16, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-virt__vm-clone/instruction.md b/evaluation/without_skills/rh-virt__vm-clone/instruction.md new file mode 100644 index 00000000..922cf2fb --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-clone/instruction.md @@ -0,0 +1,13 @@ +# VM Cloning Task + +You are an OpenShift Virtualization administrator. The QA team needs an exact copy of the production database VM to test a schema migration. Plan the clone so it can run safely in the test environment without affecting production. + +## Requirements +- Inspect the source VM (`production-db` in `prod-vms`) to understand its current state, storage configuration, and whether it needs to be stopped for cloning +- Determine the right cloning approach: whether the storage backend supports efficient cloning, and whether the VM needs to be offline +- Plan the clone target (`test-db-clone` in `test-env`) ensuring it has no network or storage conflicts with the source +- Verify the clone will be fully independent: separate disks, no shared PVCs, and safe to modify without affecting production + +Document your cloning plan in `/root/report.md`. + +Use MCP tools to examine the cluster. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-virt__vm-clone/solution/solve.sh b/evaluation/without_skills/rh-virt__vm-clone/solution/solve.sh new file mode 100644 index 00000000..17534b7f --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-clone/solution/solve.sh @@ -0,0 +1,34 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Cloning Plan + +## Source: production-db (prod-vms) → Target: test-db-clone (test-env) + +### Cloning Strategy +Using **Clone Storage** (full copy) strategy. Three strategies available: +1. **Clone Storage** (selected) — Full copy of all DataVolumes/PVCs. Independent clone. +2. **Reference Existing** — Shared disk. NOT safe for database workloads. +3. **New Empty Storage** — Fresh disk. Loses data. + +Full copy ensures test-db-clone is completely independent from production-db. + +### Spec Modifications for Clone +- Set `runStrategy: Halted` (don't auto-start the clone) +- Regenerate `domain.firmware.uuid` and `domain.firmware.serial` to avoid conflicts +- Update metadata.name to `test-db-clone` +- Update metadata.namespace to `test-env` +- Update DataVolume names to avoid collision + +### Storage Cloning +- Discover source DataVolumes via label: `vm.kubevirt.io/name=production-db` +- CSI volume cloning support required on the StorageClass +- Create new DataVolume with `source.pvc` referencing the original +- **Reference Existing** = shared disk — data corruption risk if both VMs run +- Use `resources_create_or_update` to create cloned VM and DataVolume + +### Verification +- Check target name `test-db-clone` doesn't exist in `test-env` +- Verify CSI driver supports volume cloning +- Monitor DataVolume clone progress + +REPORT_EOF diff --git a/evaluation/without_skills/rh-virt__vm-clone/task.toml b/evaluation/without_skills/rh-virt__vm-clone/task.toml new file mode 100644 index 00000000..46cd3410 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-clone/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-clone" +name = "rh-virt VM Cloning Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-clone", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-virt__vm-clone/tests/llm_judge.py b/evaluation/without_skills/rh-virt__vm-clone/tests/llm_judge.py new file mode 100644 index 00000000..749897f2 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-clone/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "firmware_uuid_regeneration", "file": "/root/report.md", "question": "Does the report address firmware UUID/serial regeneration to avoid identity conflicts between source and clone?", "reference": "A skilled report regenerates domain.firmware.uuid and serial in the clone spec. An unskilled report clones without addressing firmware identity."}, + {"id": "storage_clone_strategy", "file": "/root/report.md", "question": "Does the report discuss DataVolume clone strategy using source.pvc and StorageClass considerations?", "reference": "A skilled report uses DataVolume with source.pvc and considers CSI clone support. An unskilled report copies data manually."}, + {"id": "halted_run_strategy", "file": "/root/report.md", "question": "Does the report set runStrategy: Halted for the cloned VM to start in Stopped state?", "reference": "A skilled report ensures the clone starts halted. An unskilled report starts the clone immediately, risking conflicts."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-virt__vm-clone/tests/test.sh b/evaluation/without_skills/rh-virt__vm-clone/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-clone/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-virt__vm-clone/tests/test_outputs.py b/evaluation/without_skills/rh-virt__vm-clone/tests/test_outputs.py new file mode 100644 index 00000000..1638de54 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-clone/tests/test_outputs.py @@ -0,0 +1,90 @@ +""" +Tests for rh-virt__vm-clone per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_source_and_target(self): + content = read_report().lower() + has_source = any(t in content for t in ["source", "original", "production"]) + has_target = any(t in content for t in ["clone", "target", "copy", "destination"]) + assert has_source and has_target, "report should identify both a source VM and a clone target" + + +class TestSkillDependent: + def test_storage_class_cloning(self): + """Skill: StorageClass/CSI for PVC cloning strategy.""" + c = read_report().lower() + assert any(t in c for t in ["storageclass", "storage class", "csi", "volume cloning", "pvc clone", "clone support"]), ( + "should mention StorageClass or CSI cloning for clone strategy" + ) + + def test_identity_conflict(self): + """Skill: hostname, cloud-init, SSH key, firmware UUID conflicts between source and clone.""" + c = read_report().lower() + assert any(t in c for t in ["hostname", "cloud-init", "cloud init", "ssh key", "firmware", "uuid", "mac address", "identity conflict"]), ( + "should address identity conflicts (hostname, cloud-init, UUID) between source and clone" + ) + + def test_cross_namespace_rbac(self): + """Skill: RBAC/permissions for cross-namespace cloning.""" + c = read_report().lower() + assert any(t in c for t in ["rbac", "permission", "cross-namespace", "cross namespace", "target namespace", "create virtualmachine"]), ( + "should address RBAC or permissions for cross-namespace cloning" + ) + + def test_data_volume_cloning(self): + """Skill: DataVolume with source PVC for clone provisioning.""" + c = read_report().lower() + assert any(t in c for t in ["datavolume", "data volume", "source.pvc", "source pvc", "pvc datasource", "clone storage"]), ( + "should discuss DataVolume or PVC cloning for clone storage" + ) + + def test_datavolume_progress(self): + """Skill: Monitor DataVolume phase (Pending/Succeeded) during clone.""" + c = read_report().lower() + assert any(t in c for t in ["datavolume", "phase", "pending", "succeeded", "cloning progress", "status.phase"]), ( + "should mention monitoring DataVolume phase during clone" + ) + + def test_firmware_uuid_regeneration(self): + """Skill teaches domain.firmware.uuid and domain.firmware.serial must be + regenerated in clone spec to avoid identity conflicts. Without skill, + agents clone without regenerating firmware identifiers.""" + c = read_report().lower() + assert "firmware" in c and ("uuid" in c or "serial" in c), ( + "should address firmware UUID/serial regeneration for clone" + ) + + def test_run_strategy_halted_for_clone(self): + """Skill teaches runStrategy: Halted ensures cloned VM starts in Stopped state. + Without skill, agents start clone immediately.""" + c = read_report().lower() + assert any(t in c for t in ["halted", "runstrategy", "run strategy"]) and ( + "clone" in c or "stop" in c + ), "should set runStrategy: Halted for cloned VM" + + def test_source_pvc_bound(self): + """Docs teach CSI clone prerequisite: source PVC must be in Bound state. + Without docs, agents attempt cloning from unbound PVCs.""" + c = read_report().lower() + assert any(t in c for t in [ + "bound", "pvc status", "source pvc", "prerequisite", + ]) and ("pvc" in c or "storage" in c), ( + "should verify source PVC is Bound before cloning" + ) diff --git a/evaluation/without_skills/rh-virt__vm-create/environment/Dockerfile b/evaluation/without_skills/rh-virt__vm-create/environment/Dockerfile new file mode 100644 index 00000000..f77e513d --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-create/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-virt__vm-create/environment/mcp-servers/mock-virt-mcp.py b/evaluation/without_skills/rh-virt__vm-create/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..7b17408d --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-create/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1518 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("vm-testing", {"env": "testing"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + +STORAGE_CLASSES = [ + { + "name": "ocs-storagecluster-ceph-rbd", + "provisioner": "openshift-storage.rbd.csi.ceph.com", + "reclaimPolicy": "Delete", + "volumeBindingMode": "Immediate", + "allowVolumeExpansion": True, + }, + { + "name": "ocs-storagecluster-cephfs", + "provisioner": "openshift-storage.cephfs.csi.ceph.com", + "reclaimPolicy": "Delete", + "volumeBindingMode": "Immediate", + "allowVolumeExpansion": False, + }, +] + +VOLUME_SNAPSHOT_CLASSES = [ + { + "name": "ocs-storagecluster-rbdplugin-snapclass", + "driver": "openshift-storage.rbd.csi.ceph.com", + "deletionPolicy": "Delete", + }, +] + + +def _build_storage_class(sc): + """Build a storage.k8s.io/v1 StorageClass resource.""" + res = { + "apiVersion": "storage.k8s.io/v1", + "kind": "StorageClass", + "metadata": { + "name": sc["name"], + "uid": _uid(sc["name"]), + "creationTimestamp": CREATED, + }, + "provisioner": sc["provisioner"], + "reclaimPolicy": sc["reclaimPolicy"], + "volumeBindingMode": sc["volumeBindingMode"], + } + if sc.get("allowVolumeExpansion"): + res["allowVolumeExpansion"] = True + return res + + +def _build_volume_snapshot_class(vsc): + """Build a snapshot.storage.k8s.io/v1 VolumeSnapshotClass resource.""" + return { + "apiVersion": "snapshot.storage.k8s.io/v1", + "kind": "VolumeSnapshotClass", + "metadata": { + "name": vsc["name"], + "uid": _uid(vsc["name"]), + "creationTimestamp": CREATED, + }, + "driver": vsc["driver"], + "deletionPolicy": vsc["deletionPolicy"], + } + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-virt__vm-create/instruction.md b/evaluation/without_skills/rh-virt__vm-create/instruction.md new file mode 100644 index 00000000..f35ed63f --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-create/instruction.md @@ -0,0 +1,14 @@ +# VM Creation Task + +You are an OpenShift Virtualization administrator. The development team needs a new RHEL 9 VM for testing. Provision `test-vm` in the `vm-testing` namespace with appropriate resources. + +## Requirements +- Examine the cluster to determine available node capacity, storage classes, and existing VM templates +- Define the VM specification: 2 CPUs, 4Gi memory, 30Gi root disk, RHEL 9 operating system +- Choose the storage provisioning strategy (which storage class, access mode, volume mode) based on what the cluster offers +- Document what could go wrong during provisioning (e.g., insufficient capacity, storage class not available, image pull failure) and how to handle each case +- Provide the complete VM resource definition + +Document your provisioning plan and VM specification in `/root/report.md`. + +Use MCP tools to examine the cluster. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-virt__vm-create/solution/solve.sh b/evaluation/without_skills/rh-virt__vm-create/solution/solve.sh new file mode 100644 index 00000000..311af1b5 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-create/solution/solve.sh @@ -0,0 +1,71 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Creation Plan + +## Target: test-vm in vm-testing + +### VirtualMachine Specification + +```yaml +apiVersion: kubevirt.io/v1 +kind: VirtualMachine +metadata: + name: test-vm + namespace: vm-testing +spec: + runStrategy: Always + template: + spec: + domain: + cpu: + cores: 2 + resources: + requests: + memory: 4Gi + devices: + disks: + - name: rootdisk + disk: + bus: virtio + volumes: + - name: rootdisk + dataVolume: + name: test-vm-rootdisk + dataVolumeTemplates: + - metadata: + name: test-vm-rootdisk + spec: + source: + registry: + url: docker://registry.redhat.io/rhel9/rhel-guest-image:latest + storage: + resources: + requests: + storage: 30Gi +``` + +### Storage Configuration +- Using DataVolume with registry source for RHEL 9 guest image +- DataVolume automatically provisions PVC via CDI +- Default StorageClass used (annotated with storageclass.kubernetes.io/is-default-class) + +### VM Lifecycle +- `runStrategy: Always` ensures VM starts automatically and restarts on failure +- Alternative: `running: true` for simple start, but runStrategy provides more control +- Instance type/size: small (2 vCPU, 4Gi) for testing purposes + +### Default Credentials +- RHEL 9 guest image: requires cloud-init or SSH key for access + +### Prerequisite Checks +- Verify namespace vm-testing exists +- Check default StorageClass is configured (annotation storageclass.kubernetes.io/is-default-class) +- Verify KubeVirt operator is running +- Ensure sufficient node resources (2 CPU, 4Gi memory) + +### Error Handling (from vm-create skill) +- **ErrorUnschedulable**: Consult scheduling-errors.md; add tolerations via oc patch if node taints block scheduling +- **ErrorDataVolumeNotReady**: Storage provisioning; verify StorageClass, check CDI/DataVolume status +- Access VM: `virtctl console test-vm -n vm-testing` or VNC via OpenShift Console + +REPORT_EOF diff --git a/evaluation/without_skills/rh-virt__vm-create/task.toml b/evaluation/without_skills/rh-virt__vm-create/task.toml new file mode 100644 index 00000000..d6ab031e --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-create/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-create" +name = "rh-virt VM Creation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-create", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-virt__vm-create/tests/llm_judge.py b/evaluation/without_skills/rh-virt__vm-create/tests/llm_judge.py new file mode 100644 index 00000000..8fb930ee --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-create/tests/llm_judge.py @@ -0,0 +1,92 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "unschedulable_handling", "file": "/root/report.md", "question": "Does the report address ErrorUnschedulable and taint/toleration handling for VM placement?", "reference": "A skilled report handles scheduling errors with tolerations. An unskilled report doesn't address placement failures."}, + {"id": "datavolume_provisioning", "file": "/root/report.md", "question": "Does the report describe using DataVolume resources (with CDI) for VM disk provisioning, specifying a source (registry, blank, or PVC)?", "reference": "A skilled report uses DataVolume with a source specification for disk provisioning. An unskilled report creates PVCs manually without CDI integration."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-virt__vm-create/tests/test.sh b/evaluation/without_skills/rh-virt__vm-create/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-create/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-virt__vm-create/tests/test_outputs.py b/evaluation/without_skills/rh-virt__vm-create/tests/test_outputs.py new file mode 100644 index 00000000..5cf84d0d --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-create/tests/test_outputs.py @@ -0,0 +1,71 @@ +""" +Tests for rh-virt__vm-create per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_vm(self): + content = read_report().lower() + assert any(t in content for t in ["vm", "virtual machine", "virtualmachine"]), ( + "report should reference the target VM" + ) + + def test_mentions_namespace(self): + content = read_report().lower() + assert "namespace" in content, "report should mention the target namespace" + + +class TestSkillDependent: + def test_data_volume_provisioning(self): + """Skill: DataVolume for disk provisioning with image/blank source.""" + c = read_report().lower() + assert any(t in c for t in ["datavolume", "data volume", "cdi.kubevirt.io", "source.registry", "source.blank"]), ( + "should discuss DataVolume for disk provisioning" + ) + + def test_storage_class_provisioning(self): + """Skill: StorageClass for DataVolume/PVC provisioning.""" + c = read_report().lower() + assert any(t in c for t in ["storageclass", "storage class", "volumeBindingMode", "provisioner"]) and ( + "storage" in c or "pvc" in c or "datavolume" in c + ), ( + "should mention StorageClass for disk provisioning" + ) + + def test_instance_type_or_workload(self): + """Skill: Instance type (u1.medium) or workload (fedora, rhel) resolution.""" + c = read_report().lower() + assert any(t in c for t in ["instancetype", "instance type", "u1.", "u1.medium", "workload", "fedora", "rhel", "ubuntu", "centos"]), ( + "should reference instance types or workload/OS selection" + ) + + def test_unschedulable_toleration(self): + """Skill: ErrorUnschedulable and toleration workaround.""" + c = read_report().lower() + assert any(t in c for t in ["errorunschedulable", "unschedulable", "taint", "toleration", "scheduling"]) and ( + "taint" in c or "toleration" in c or "unschedulable" in c + ), ( + "should address ErrorUnschedulable and taint/toleration handling" + ) + + def test_yaml_or_manifest(self): + """Should include a YAML manifest or structured spec.""" + content = read_report() + assert "apiVersion" in content or "kind:" in content or "spec:" in content or "```yaml" in content or "```yml" in content, ( + "should include a YAML manifest or structured specification" + ) diff --git a/evaluation/without_skills/rh-virt__vm-delete/environment/Dockerfile b/evaluation/without_skills/rh-virt__vm-delete/environment/Dockerfile new file mode 100644 index 00000000..a76f03e8 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-delete/environment/Dockerfile @@ -0,0 +1,50 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-virt__vm-delete/environment/mcp-servers/mock-virt-mcp.py b/evaluation/without_skills/rh-virt__vm-delete/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..2aaace7d --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-delete/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1464 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("decommission", {"env": "decommission"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── decommission (instruction-specific) ────────────────────────────── + _vm("legacy-app", "decommission", "hv-prod-dc1-01", "rhel-8.6", None, + {"app": "legacy-app", "criticality": "low", "legacy": "true"}, + 2, 4, "Running", True, 30), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-virt__vm-delete/instruction.md b/evaluation/without_skills/rh-virt__vm-delete/instruction.md new file mode 100644 index 00000000..5769196b --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-delete/instruction.md @@ -0,0 +1,12 @@ +# VM Deletion Task + +You are an OpenShift Virtualization administrator. Plan the safe deletion of VM `legacy-app` in namespace `decommission`. + +## Requirements +- Perform pre-deletion safety checks +- Define the deletion scope (VM only vs VM + storage) +- Include safeguards against accidental deletion + +Use MCP tools to examine the cluster. Document your methodology, findings, and deletion plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-virt__vm-delete/solution/solve.sh b/evaluation/without_skills/rh-virt__vm-delete/solution/solve.sh new file mode 100644 index 00000000..6d87b29d --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-delete/solution/solve.sh @@ -0,0 +1,31 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Deletion Plan + +## Target: legacy-app in decommission + +### Pre-Deletion Safety Checks +1. **Protection label**: Check `metadata.labels.protected` — if `"true"`, deletion is blocked. Remove with `oc label vm legacy-app -n decommission protected-` +2. **Running state**: If VM is running, stop it first via `vm_lifecycle` action=stop +3. **Storage discovery**: List DataVolumes with label `vm.kubevirt.io/name=legacy-app` + +### Deletion Scope Options +- **VM Only** — Keep associated storage (DataVolumes/PVCs) for data recovery +- **VM + Storage** (selected) — Full cleanup of VM and all associated DataVolumes/PVCs + +### Deletion Procedure +1. Verify VM exists and is stopped (use vm_lifecycle action=stop if running) +2. List all associated DataVolumes (apiVersion: cdi.kubevirt.io/v1beta1, labelSelector: vm.kubevirt.io/name=legacy-app) +3. Present deletion scope and storage list +4. **Typed confirmation required**: User must type exact VM name `legacy-app` to proceed +5. Delete VM via resources_delete +6. Delete associated DataVolumes and PVCs via resources_delete +7. Verify deletion completed (resource no longer exists) +8. If VM stuck Terminating: consult lifecycle-errors.md, check finalizers + +### Post-Deletion Verification +- Confirm VM resource is gone +- Confirm DataVolumes and PVCs are cleaned up +- Check for any orphaned resources (finalizers) + +REPORT_EOF diff --git a/evaluation/without_skills/rh-virt__vm-delete/task.toml b/evaluation/without_skills/rh-virt__vm-delete/task.toml new file mode 100644 index 00000000..063c79fd --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-delete/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-delete" +name = "rh-virt VM Deletion Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-delete", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-virt__vm-delete/tests/llm_judge.py b/evaluation/without_skills/rh-virt__vm-delete/tests/llm_judge.py new file mode 100644 index 00000000..e1bed079 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-delete/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "protected_label", "file": "/root/report.md", "question": "Does the report check for protected: true label that blocks deletion?", "reference": "A skilled report checks protection labels. An unskilled report attempts deletion without safety checks."}, + {"id": "storage_scope", "file": "/root/report.md", "question": "Does the report distinguish VM-only vs VM+storage deletion and warn about orphaned PVCs?", "reference": "A skilled report offers storage scope choice. An unskilled report deletes everything without distinction."}, + {"id": "typed_confirmation", "file": "/root/report.md", "question": "Does the report require typed VM name confirmation (exact, case-sensitive) before deletion?", "reference": "A skilled report requires exact name match confirmation. An unskilled report uses yes/no confirmation."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-virt__vm-delete/tests/test.sh b/evaluation/without_skills/rh-virt__vm-delete/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-delete/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-virt__vm-delete/tests/test_outputs.py b/evaluation/without_skills/rh-virt__vm-delete/tests/test_outputs.py new file mode 100644 index 00000000..a1c73806 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-delete/tests/test_outputs.py @@ -0,0 +1,82 @@ +""" +Tests for rh-virt__vm-delete per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_vm(self): + content = read_report().lower() + assert any(t in content for t in ["vm", "virtual machine", "virtualmachine"]), ( + "report should reference the target VM" + ) + + def test_mentions_deletion(self): + content = read_report().lower() + assert "delet" in content, "report should discuss deletion" + + +class TestSkillDependent: + def test_stop_before_delete(self): + """Skill: Must stop VM before deletion; vm_lifecycle stop.""" + c = read_report().lower() + assert any(t in c for t in ["stop before delet", "stop and delete", "vm_lifecycle", "halt", "must stop", "running"]) and ( + "stop" in c or "halt" in c + ), ( + "should require stopping VM before deletion" + ) + + def test_orphan_storage(self): + """Skill: VM-only vs VM+storage; orphan PVCs; delete DataVolume/PVC.""" + c = read_report().lower() + assert any(t in c for t in ["vm only", "vm+storage", "datavolume", "orphan", "preserve storage", "delete storage", "pvc"]) and ( + "storage" in c or "pvc" in c or "datavolume" in c + ), ( + "should address storage scope (VM-only vs VM+storage, orphan PVCs)" + ) + + def test_finalizer_handling(self): + """Skill: Finalizer blocking deletion; stuck Terminating.""" + c = read_report().lower() + assert any(t in c for t in ["finalizer", "terminating", "stuck", "resources_create_or_update", "remove finalizer"]), ( + "should address finalizer handling for stuck deletion" + ) + + def test_typed_confirmation(self): + """Skill: Typed VM name confirmation (exact match) before delete.""" + c = read_report().lower() + assert any(t in c for t in ["type", "typed", "exact name", "confirm", "to confirm"]) and ( + "name" in c or "vm" in c + ), ( + "should require typed VM name confirmation" + ) + + def test_protected_label(self): + """Skill: protected: true label blocks deletion.""" + c = read_report().lower() + assert any(t in c for t in ["protected", "protected label", "metadata.labels", "refuse delet"]), ( + "should address protected label blocking deletion" + ) + + def test_reclaim_policy_retain(self): + """Docs teach PV reclaim policy Retain blocks PVC deletion; must patch PV + to Delete first. Without docs, agents don't handle stuck PVC cleanup.""" + c = read_report().lower() + assert any(t in c for t in [ + "retain", "reclaim", "reclaimpolicy", "reclaim policy", + "patch pv", "delete policy", + ]), "should address PV reclaim policy Retain blocking cleanup" diff --git a/evaluation/without_skills/rh-virt__vm-inventory/environment/Dockerfile b/evaluation/without_skills/rh-virt__vm-inventory/environment/Dockerfile new file mode 100644 index 00000000..a76f03e8 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-inventory/environment/Dockerfile @@ -0,0 +1,50 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-virt__vm-inventory/environment/mcp-servers/mock-virt-mcp.py b/evaluation/without_skills/rh-virt__vm-inventory/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..2e083d72 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-inventory/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1458 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-virt__vm-inventory/instruction.md b/evaluation/without_skills/rh-virt__vm-inventory/instruction.md new file mode 100644 index 00000000..28107e57 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-inventory/instruction.md @@ -0,0 +1,14 @@ +# VM Inventory Task + +You are an OpenShift Virtualization administrator. Your team needs a complete picture of every VM in the cluster for capacity planning and compliance reporting. + +## Requirements +- List every VM across all namespaces, grouped by namespace +- For each VM report: name, status (Running/Stopped/Paused), CPU and memory allocation, operating system, and IP address if running +- Identify any VMs with issues: stopped unexpectedly, guest agent not responding, degraded conditions, or running end-of-life operating systems +- Summarize totals: how many VMs per namespace, how many running vs stopped, total resource allocation +- Sort results by namespace, then by VM name + +Write the inventory report in `/root/report.md`. + +Use MCP tools to gather VM data. If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-virt__vm-inventory/solution/solve.sh b/evaluation/without_skills/rh-virt__vm-inventory/solution/solve.sh new file mode 100644 index 00000000..3473c6d5 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-inventory/solution/solve.sh @@ -0,0 +1,32 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Inventory Report + +## Cluster VM Summary + +| Namespace | VM Name | Status | CPU | Memory | Storage | OS | IP | Node | +|-----------|---------|--------|-----|--------|---------|----|----|------| +| prod-vms | production-db | Running | 4 vCPU, 16Gi | 100Gi | RHEL 9.3 | 10.128.2.15 | worker-01 | +| prod-vms | web-frontend | Running | 2 vCPU, 4Gi | 50Gi | Fedora 39 | 10.128.2.16 | worker-02 | +| dev-vms | dev-test | Stopped | 2 vCPU, 8Gi | 50Gi | Ubuntu 22.04 | — | — | + +### Status Summary +- Running: 2 +- Stopped: 1 +- Total: 3 + +### Data Sources +- VM status: `status.printableStatus` from VirtualMachine resource +- Resource details: Extracted from VirtualMachineInstance (VMI) when running via resources_list (apiVersion kubevirt.io/v1, allNamespaces=true for cluster-wide) +- CPU: `.spec.domain.cpu.sockets` (displayed as vCPU) +- Memory: `.spec.domain.memory.guest` +- Storage: `.status.volumeStatus[].persistentVolumeClaimInfo.capacity.storage` +- OS: `.status.guestOSInfo.prettyName` +- IP: `.status.interfaces[0].ipAddress` +- Node: `.status.nodeName` +- Conditions: Ready, AgentConnected, LiveMigratable + +### Sort Order +Sorted by: Namespace → Status (Running → Pending → Stopped → Failed) → VM Name + +REPORT_EOF diff --git a/evaluation/without_skills/rh-virt__vm-inventory/task.toml b/evaluation/without_skills/rh-virt__vm-inventory/task.toml new file mode 100644 index 00000000..6a756f27 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-inventory/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-inventory" +name = "rh-virt VM Inventory Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-inventory", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-virt__vm-inventory/tests/llm_judge.py b/evaluation/without_skills/rh-virt__vm-inventory/tests/llm_judge.py new file mode 100644 index 00000000..aabb1dab --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-inventory/tests/llm_judge.py @@ -0,0 +1,92 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "vmi_vs_vm_distinction", "file": "/root/report.md", "question": "Does the report distinguish between VirtualMachine (spec/desired state) and VirtualMachineInstance (runtime state) as separate resources to query?", "reference": "A skilled report queries both VM and VMI, understanding VM defines the spec while VMI reflects the running state. An unskilled report only queries VirtualMachine without VMI runtime data."}, + {"id": "status_ordering", "file": "/root/report.md", "question": "Does the report organize or sort VMs by operational status (e.g., Running first, then Pending, Stopped, Failed) rather than just listing alphabetically?", "reference": "A skilled report groups or sorts VMs by status priority. An unskilled report lists VMs in arbitrary order without status-based organization."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-virt__vm-inventory/tests/test.sh b/evaluation/without_skills/rh-virt__vm-inventory/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-inventory/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-virt__vm-inventory/tests/test_outputs.py b/evaluation/without_skills/rh-virt__vm-inventory/tests/test_outputs.py new file mode 100644 index 00000000..16ded70a --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-inventory/tests/test_outputs.py @@ -0,0 +1,67 @@ +""" +Tests for rh-virt__vm-inventory per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_has_structured_data(self): + content = read_report() + has_table = "|" in content and content.count("|") >= 4 + has_list = content.count("- ") >= 5 + assert has_table or has_list, "report should present VM inventory in a structured format (table or list)" + + def test_mentions_namespace(self): + content = read_report().lower() + assert "namespace" in content, "report should organize by namespace" + + +class TestSkillDependent: + def test_vmi_runtime_data(self): + """Skill: Query VirtualMachineInstance (VMI) for running VM runtime data.""" + c = read_report().lower() + assert any(t in c for t in ["virtualmachineinstance", "vmi", "virtual machine instance"]), ( + "should reference VMI for runtime data, not just VirtualMachine" + ) + + def test_resource_format(self): + """Skill: Resources as 'X vCPU, YGi' format, not instance type names like u1.medium.""" + c = read_report().lower() + assert any(t in c for t in ["vcpu", "vcpus"]) and any(t in c for t in ["gi", "gib"]), ( + "should use vCPU/Gi resource format, not instance type names" + ) + + def test_status_based_grouping(self): + """Skill: Sort by namespace, then status (Running > Pending > Stopped > Failed), then name.""" + c = read_report().lower() + status_terms = sum(1 for t in ["running", "stopped", "pending", "failed"] if t in c) + has_organization = any(t in c for t in [ + "group", "sort", "order", "organiz", "by namespace", + "by status", "running first", "namespace", + ]) + assert status_terms >= 2 and has_organization, ( + "should organize VMs with status awareness (Running/Stopped/etc) by namespace" + ) + + def test_conditions_awareness(self): + """Skill: KubeVirt-specific conditions — AgentConnected, LiveMigratable.""" + c = read_report().lower() + assert any(t in c for t in [ + "agentconnected", "agent connected", "agent_connected", + "livemigratable", "live migratable", "live_migratable", + "guest agent", + ]), "should mention KubeVirt-specific conditions (AgentConnected, LiveMigratable)" diff --git a/evaluation/without_skills/rh-virt__vm-lifecycle-manager/environment/Dockerfile b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/environment/Dockerfile new file mode 100644 index 00000000..a76f03e8 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/environment/Dockerfile @@ -0,0 +1,50 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-virt__vm-lifecycle-manager/environment/mcp-servers/mock-virt-mcp.py b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..31b95dd3 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1467 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("prod-vms", {"env": "production"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── prod-vms (instruction-specific) ────────────────────────────────── + _vm("web-frontend", "prod-vms", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "customer-facing": "true", "criticality": "high"}, + 4, 8, "Running", True, 1), + _vm("production-db", "prod-vms", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true"}, + 8, 16, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-virt__vm-lifecycle-manager/instruction.md b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/instruction.md new file mode 100644 index 00000000..622a3d38 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/instruction.md @@ -0,0 +1,12 @@ +# VM Lifecycle Operations Task + +You are an OpenShift Virtualization administrator. Plan lifecycle operations for VMs in the cluster: stop `web-frontend` and restart `production-db`, both in namespace `prod-vms`. + +## Requirements +- Define the procedure for each operation +- Address the correct sequencing for restart (not a single atomic operation) +- Include verification steps + +Use MCP tools to examine the cluster. Document your methodology, procedures, and verification steps in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-virt__vm-lifecycle-manager/solution/solve.sh b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/solution/solve.sh new file mode 100644 index 00000000..851a4668 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/solution/solve.sh @@ -0,0 +1,29 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Lifecycle Operations Plan + +## Operation 1: Stop web-frontend +- Tool: `vm_lifecycle(namespace="prod-vms", name="web-frontend", action="stop")` +- Effect: Sets runStrategy to Halted +- Verify: `status.printableStatus` changes to "Stopped" + +## Operation 2: Restart production-db +Restart requires TWO separate calls to avoid resourceVersion conflicts: +1. `vm_lifecycle(namespace="prod-vms", name="production-db", action="stop")` +2. Wait for `status.printableStatus == "Stopped"` (poll every 5 seconds) +3. `vm_lifecycle(namespace="prod-vms", name="production-db", action="start")` + +### RunStrategy Mapping +| Action | RunStrategy Set | +|--------|----------------| +| start | Always | +| stop | Halted | +| restart | Always (after stop completes) | + +### Caveats +- Restart is NOT a single atomic operation — it's stop + wait + start +- Graceful shutdown: VM guest agent handles ACPI shutdown signal +- If VM doesn't stop within timeout, force stop may be needed +- Always verify stopped status before issuing start to avoid conflicts + +REPORT_EOF diff --git a/evaluation/without_skills/rh-virt__vm-lifecycle-manager/task.toml b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/task.toml new file mode 100644 index 00000000..29808afd --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-lifecycle-manager" +name = "rh-virt VM Lifecycle Management Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-lifecycle-manager", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-virt__vm-lifecycle-manager/tests/llm_judge.py b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/tests/llm_judge.py new file mode 100644 index 00000000..1e8ef2e1 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "two_step_restart", "file": "/root/report.md", "question": "Does the report implement restart as stop→verify stopped→start rather than a single atomic operation?", "reference": "A skilled report separates stop and start to avoid resourceVersion conflicts. An unskilled report uses a single restart command."}, + {"id": "run_strategy_mapping", "file": "/root/report.md", "question": "Does the report map start to RunStrategy: Always and stop to RunStrategy: Halted?", "reference": "A skilled report uses RunStrategy for lifecycle control. An unskilled report uses power state concepts."}, + {"id": "state_verification", "file": "/root/report.md", "question": "Does the report verify VM reached expected state (Stopped/Running) before proceeding to the next operation?", "reference": "A skilled report verifies printableStatus between operations. An unskilled report assumes instant state changes."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-virt__vm-lifecycle-manager/tests/test.sh b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-virt__vm-lifecycle-manager/tests/test_outputs.py b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/tests/test_outputs.py new file mode 100644 index 00000000..98907dad --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-lifecycle-manager/tests/test_outputs.py @@ -0,0 +1,75 @@ +""" +Tests for rh-virt__vm-lifecycle-manager per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_operations(self): + c = read_report().lower() + assert ("stop" in c or "halt" in c) and ("restart" in c or "start" in c), ( + "report should discuss stop and restart operations" + ) + + def test_mentions_vms(self): + c = read_report().lower() + assert any(t in c for t in ["vm", "virtual machine", "virtualmachine"]), ( + "report should reference the target VMs" + ) + + +class TestSkillDependent: + def test_two_step_restart(self): + """Skill: Restart = stop then start (not single atomic); resourceVersion conflict.""" + c = read_report().lower() + assert ("stop" in c and "start" in c) and any(t in c for t in ["two", "separate", "sequence", "then", "first", "resourceversion", "conflict"]), ( + "should explain restart as stop-then-start, not single operation" + ) + + def test_run_strategy_control(self): + """Skill: RunStrategy Always/Halted for start/stop; not generic power state.""" + c = read_report().lower() + assert any(t in c for t in ["runstrategy", "run strategy", "always", "halted"]) and ( + "start" in c or "stop" in c + ), ( + "should map start/stop to RunStrategy (Always/Halted)" + ) + + def test_ready_verification(self): + """Skill: Verify status.printableStatus Stopped/Running after each step.""" + c = read_report().lower() + assert any(t in c for t in ["printablestatus", "printable status", "status", "stopped", "running"]) and ( + any(t in c for t in ["verify", "check", "poll", "wait", "before start"]) + ), ( + "should verify VM reached expected state before proceeding" + ) + + def test_vm_lifecycle_tool(self): + """Skill: vm_lifecycle MCP tool for start/stop/restart.""" + c = read_report().lower() + assert any(t in c for t in ["vm_lifecycle", "vm lifecycle", "lifecycle tool", "mcp"]), ( + "should reference vm_lifecycle or MCP lifecycle tool" + ) + + def test_restart_composite(self): + """Skill: Restart implemented as stop → verify stopped → wait → start.""" + c = read_report().lower() + has_stop_start = "stop" in c and "start" in c + has_wait = any(t in c for t in ["wait", "5 second", "poll", "verify stopped"]) + assert has_stop_start and has_wait, ( + "should include wait/verify between stop and start for restart" + ) diff --git a/evaluation/without_skills/rh-virt__vm-rebalance/environment/Dockerfile b/evaluation/without_skills/rh-virt__vm-rebalance/environment/Dockerfile new file mode 100644 index 00000000..a76f03e8 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-rebalance/environment/Dockerfile @@ -0,0 +1,50 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-virt__vm-rebalance/environment/mcp-servers/mock-virt-mcp.py b/evaluation/without_skills/rh-virt__vm-rebalance/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..2e083d72 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-rebalance/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1458 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-virt__vm-rebalance/instruction.md b/evaluation/without_skills/rh-virt__vm-rebalance/instruction.md new file mode 100644 index 00000000..b4e5c640 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-rebalance/instruction.md @@ -0,0 +1,13 @@ +# VM Rebalancing Task + +You are an OpenShift Virtualization administrator. Node `hv-prod-dc1-02` is critically overloaded (88% CPU, 82% memory). Plan how to rebalance its workloads by migrating one or more VMs to less utilized nodes. + +## Requirements +- Examine current node utilization and identify which VMs on `hv-prod-dc1-02` are candidates for migration +- Evaluate migration feasibility for each candidate and determine the appropriate migration method +- Select appropriate target nodes based on available capacity and schedulability +- Identify risks and safety considerations that could affect the migration + +Use MCP tools to examine the cluster. Document your methodology, findings, and migration plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-virt__vm-rebalance/solution/solve.sh b/evaluation/without_skills/rh-virt__vm-rebalance/solution/solve.sh new file mode 100644 index 00000000..1f48a04e --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-rebalance/solution/solve.sh @@ -0,0 +1,41 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Rebalancing Plan + +## Current State +Node hv-prod-dc1-02 is critically overloaded: 88% CPU (14080m/16000m), 82% memory (53739Mi/65536Mi). +VMs on this node: vm-web-prod-03, vm-api-prod-01, vm-cache-prod-01, vm-etl-prod-01. + +## Migration Candidates +- vm-web-prod-03 (4 CPU, 8Gi): good candidate, RWX storage supports live migration +- vm-cache-prod-01 (2 CPU, 4Gi): good candidate, small footprint +- vm-etl-prod-01 (4 CPU, 8Gi): degraded (high I/O latency), could benefit from migration but risky during active I/O + +## Live Migration Prerequisites +1. **Storage access mode**: Must be ReadWriteMany (RWX) for live migration. ReadWriteOnce (RWO) requires cold migration (VM must be stopped first). +2. **Node schedulability**: Target node must be schedulable (not cordoned or in maintenance). +3. **CPU model compatibility**: Source and target nodes must support the same CPU model. +4. **Available capacity**: Use allocated vCPU/memory from VM spec, not runtime usage metrics. + +## Target Node Selection +- hv-prod-dc1-01: 74% CPU, 68% memory — can accept one small VM +- hv-prod-dc1-03: cordoned for maintenance — NOT schedulable +- hv-prod-dc2-01/02: different datacenter zone, only for cross-zone rebalancing + +Recommendation: Migrate vm-cache-prod-01 (2 CPU, 4Gi) to hv-prod-dc1-01. + +## Anti-Patterns to Avoid +- **No ping-pong**: Don't migrate VMs back and forth between nodes repeatedly +- **Avoid resource overcommit**: Calculate post-migration allocated resources to ensure target stays below 85% +- **Don't migrate during peak hours**: Schedule during maintenance windows +- **Cold migration caution**: Re-read VM before updating nodeAffinity to avoid resourceVersion conflict +- **Overcommit warning**: If any node exceeds 85% after rebalance, escalate + +## Migration Procedure +1. Verify vm-cache-prod-01 storage is RWX (live migration supported) +2. Verify hv-prod-dc1-01 has capacity for 2 CPU + 4Gi after migration +3. Create VirtualMachineInstanceMigration resource +4. Monitor migration progress for convergence +5. Verify VM is healthy on target node post-migration + +REPORT_EOF diff --git a/evaluation/without_skills/rh-virt__vm-rebalance/task.toml b/evaluation/without_skills/rh-virt__vm-rebalance/task.toml new file mode 100644 index 00000000..d79dfbba --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-rebalance/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-rebalance" +name = "rh-virt VM Rebalancing Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-rebalance", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-virt__vm-rebalance/tests/llm_judge.py b/evaluation/without_skills/rh-virt__vm-rebalance/tests/llm_judge.py new file mode 100644 index 00000000..76052f1f --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-rebalance/tests/llm_judge.py @@ -0,0 +1,92 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "cpu_compatibility_check", "file": "/root/report.md", "question": "Does the report check CPU model or feature compatibility between source and target nodes before recommending migration?", "reference": "A skilled report verifies CPU compatibility (model, features) to ensure live migration success. An unskilled report migrates VMs without CPU compatibility checks."}, + {"id": "overcommit_awareness", "file": "/root/report.md", "question": "Does the report assess overcommit risk (whether the target node will exceed capacity after receiving migrated VMs)?", "reference": "A skilled report calculates whether the target node can handle the additional load without overcommitting. An unskilled report moves VMs without capacity verification."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-virt__vm-rebalance/tests/test.sh b/evaluation/without_skills/rh-virt__vm-rebalance/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-rebalance/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-virt__vm-rebalance/tests/test_outputs.py b/evaluation/without_skills/rh-virt__vm-rebalance/tests/test_outputs.py new file mode 100644 index 00000000..ea445584 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-rebalance/tests/test_outputs.py @@ -0,0 +1,57 @@ +""" +Tests for rh-virt__vm-rebalance per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_migration(self): + content = read_report().lower() + assert "migrat" in content, "report should discuss migration" + + def test_mentions_node(self): + content = read_report().lower() + assert any(t in content for t in ["node", "overload", "imbalance", "utilization"]), ( + "report should reference cluster nodes or load imbalance" + ) + + +class TestSkillDependent: + def test_cpu_compatibility(self): + """Skill: CPU model/feature compatibility between source and target nodes.""" + c = read_report().lower() + assert any(t in c for t in ["cpu model", "cpu compatible", "cpu feature", "cpu architecture", "migration compatibility"]) or ( + "cpu" in c and ("compatib" in c or "model" in c) + ), ( + "should address CPU compatibility for migration" + ) + + def test_virtualmachineinstancemigration(self): + """Skill: VirtualMachineInstanceMigration for live migration.""" + c = read_report().lower() + assert any(t in c for t in ["virtualmachineinstancemigration", "vmi migration", "migration cr", "migration resource"]), ( + "should reference VirtualMachineInstanceMigration API" + ) + + def test_overcommit_warning(self): + """Skill: Overcommit detection; warn if node exceeds 100% after rebalance.""" + c = read_report().lower() + assert any(t in c for t in ["overcommit", "over commit", "exceed 100", "capacity"]) and ( + "overcommit" in c or "100" in c or "exceed" in c + ), ( + "should address overcommit risk when rebalancing" + ) diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-create/environment/Dockerfile b/evaluation/without_skills/rh-virt__vm-snapshot-create/environment/Dockerfile new file mode 100644 index 00000000..f77e513d --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-create/environment/Dockerfile @@ -0,0 +1,63 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY docs /root/docs + +RUN mkdir -p /logs/agent/sessions && \ + ln -s /root/docs /logs/agent/sessions/docs + +COPY docs /root/.claude/docs +COPY docs /root/.codex/docs +COPY docs /root/.opencode/docs +COPY docs /root/.goose/docs +COPY docs /root/.factory/docs +COPY docs /root/.agents/docs +COPY docs /root/.gemini/docs + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-create/environment/mcp-servers/mock-virt-mcp.py b/evaluation/without_skills/rh-virt__vm-snapshot-create/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..912fb2d6 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-create/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1539 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("prod-vms", {"env": "production"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── prod-vms (instruction-specific) ────────────────────────────────── + _vm("production-db", "prod-vms", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true"}, + 8, 16, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + +STORAGE_CLASSES = [ + { + "name": "ocs-storagecluster-ceph-rbd", + "provisioner": "openshift-storage.rbd.csi.ceph.com", + "reclaimPolicy": "Delete", + "volumeBindingMode": "Immediate", + "allowVolumeExpansion": True, + }, + { + "name": "ocs-storagecluster-cephfs", + "provisioner": "openshift-storage.cephfs.csi.ceph.com", + "reclaimPolicy": "Delete", + "volumeBindingMode": "Immediate", + "allowVolumeExpansion": False, + }, +] + +VOLUME_SNAPSHOT_CLASSES = [ + { + "name": "ocs-storagecluster-rbdplugin-snapclass", + "driver": "openshift-storage.rbd.csi.ceph.com", + "deletionPolicy": "Delete", + }, +] + + +def _build_storage_class(sc): + """Build a storage.k8s.io/v1 StorageClass resource.""" + res = { + "apiVersion": "storage.k8s.io/v1", + "kind": "StorageClass", + "metadata": { + "name": sc["name"], + "uid": _uid(sc["name"]), + "creationTimestamp": CREATED, + }, + "provisioner": sc["provisioner"], + "reclaimPolicy": sc["reclaimPolicy"], + "volumeBindingMode": sc["volumeBindingMode"], + } + if sc.get("allowVolumeExpansion"): + res["allowVolumeExpansion"] = True + return res + + +def _build_volume_snapshot_class(vsc): + """Build a snapshot.storage.k8s.io/v1 VolumeSnapshotClass resource.""" + return { + "apiVersion": "snapshot.storage.k8s.io/v1", + "kind": "VolumeSnapshotClass", + "metadata": { + "name": vsc["name"], + "uid": _uid(vsc["name"]), + "creationTimestamp": CREATED, + }, + "driver": vsc["driver"], + "deletionPolicy": vsc["deletionPolicy"], + } + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "storage.k8s.io/v1" and kind == "StorageClass": + resources = [_build_storage_class(sc) for sc in STORAGE_CLASSES] + headers = ["NAME", "PROVISIONER", "RECLAIMPOLICY", "VOLUMEBINDINGMODE", "ALLOWVOLUMEEXPANSION", "AGE"] + def row(r): + return [r["metadata"]["name"], r["provisioner"], + r["reclaimPolicy"], r["volumeBindingMode"], + str(r.get("allowVolumeExpansion", False)), "90d"] + return resources, headers, row, False + + if api_version == "snapshot.storage.k8s.io/v1" and kind == "VolumeSnapshotClass": + resources = [_build_volume_snapshot_class(vsc) for vsc in VOLUME_SNAPSHOT_CLASSES] + headers = ["NAME", "DRIVER", "DELETIONPOLICY", "AGE"] + def row(r): + return [r["metadata"]["name"], r["driver"], r["deletionPolicy"], "90d"] + return resources, headers, row, False + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-create/instruction.md b/evaluation/without_skills/rh-virt__vm-snapshot-create/instruction.md new file mode 100644 index 00000000..34f38f23 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-create/instruction.md @@ -0,0 +1,12 @@ +# VM Snapshot Creation Task + +You are an OpenShift Virtualization administrator. Create a snapshot of VM `production-db` in namespace `prod-vms`. + +## Requirements +- Verify snapshot prerequisites (storage support, guest agent) +- Define the snapshot specification +- Address snapshot consistency levels and monitoring + +Use MCP tools to examine the cluster. Work autonomously — do not wait for user confirmation at any step. Document your methodology, findings, and snapshot plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-create/solution/solve.sh b/evaluation/without_skills/rh-virt__vm-snapshot-create/solution/solve.sh new file mode 100644 index 00000000..22659dde --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-create/solution/solve.sh @@ -0,0 +1,39 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Snapshot Plan + +## Target: production-db in prod-vms + +### Storage Snapshot Support Checks +1. Check VM `status.volumeSnapshotStatuses` for snapshot support +2. Verify no hot-plugged volumes (block snapshots - must stop VM and persist or remove) +3. Check StorageClass has a VolumeSnapshotClass +4. Verify CSI driver supports snapshots +5. Check for guest agent (determines consistency level) +6. Create via resources_create_or_update; poll status.phase (InProgress/Succeeded/Failed) and status.readyToUse + +### Snapshot Type +- **With guest agent**: Application-consistent (freeze/thaw of filesystem) + - `status.indications` will show `GuestAgent` +- **Without guest agent**: Crash-consistent (point-in-time disk state) + - `status.indications` will show `Online` only + +### VirtualMachineSnapshot YAML +```yaml +apiVersion: snapshot.kubevirt.io/v1beta1 +kind: VirtualMachineSnapshot +metadata: + name: production-db-backup-20240301 + namespace: prod-vms +spec: + source: + apiGroup: kubevirt.io + kind: VirtualMachine + name: production-db +``` + +### Monitoring +- Poll `status.phase`: InProgress → Succeeded or Failed +- Check `status.readyToUse: true` before relying on snapshot + +REPORT_EOF diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-create/task.toml b/evaluation/without_skills/rh-virt__vm-snapshot-create/task.toml new file mode 100644 index 00000000..c563a3ed --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-create/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-snapshot-create" +name = "rh-virt VM Snapshot Creation Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-snapshot-create", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-create/tests/llm_judge.py b/evaluation/without_skills/rh-virt__vm-snapshot-create/tests/llm_judge.py new file mode 100644 index 00000000..cf067a9c --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-create/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "volume_snapshot_class", "file": "/root/report.md", "question": "Does the report check for VolumeSnapshotClass as a prerequisite for CSI snapshot support?", "reference": "A skilled report verifies VolumeSnapshotClass exists. An unskilled report attempts snapshots without checking prerequisites."}, + {"id": "hot_plugged_blocker", "file": "/root/report.md", "question": "Does the report note that hot-plugged volumes block snapshot creation entirely?", "reference": "A skilled report checks for hot-plugged volumes. An unskilled report doesn't know about this blocker."}, + {"id": "consistency_levels", "file": "/root/report.md", "question": "Does the report distinguish application-consistent (GuestAgent) from crash-consistent (Online only) snapshots?", "reference": "A skilled report checks status.indications for GuestAgent presence. An unskilled report doesn't distinguish consistency levels."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-create/tests/test.sh b/evaluation/without_skills/rh-virt__vm-snapshot-create/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-create/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-create/tests/test_outputs.py b/evaluation/without_skills/rh-virt__vm-snapshot-create/tests/test_outputs.py new file mode 100644 index 00000000..c4189fb6 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-create/tests/test_outputs.py @@ -0,0 +1,77 @@ +""" +Tests for rh-virt__vm-snapshot-create per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_snapshot(self): + content = read_report().lower() + assert "snapshot" in content, "report should mention snapshots" + + def test_mentions_vm(self): + content = read_report().lower() + assert any(t in content for t in ["vm", "virtual machine", "virtualmachine"]), ( + "report should reference the target VM" + ) + + +class TestSkillDependent: + def test_volume_snapshot_class(self): + """Skill: VolumeSnapshotClass prerequisite for CSI snapshot support.""" + c = read_report().lower() + assert any(t in c for t in ["volumesnapshotclass", "volume snapshot class", "snapshot class", "csi driver"]), ( + "should mention VolumeSnapshotClass for snapshot prerequisites" + ) + + def test_quiesce_consistency(self): + """Skill: Quiesce/freeze for application-consistent snapshots; guest agent.""" + c = read_report().lower() + assert any(t in c for t in ["quiesce", "freeze", "thaw", "guest agent", "application-consistent", "qemu-guest-agent"]), ( + "should discuss quiesce/freeze for consistency" + ) + + def test_snapshot_cr_structure(self): + """Skill: VirtualMachineSnapshot CR with spec.source.""" + c = read_report().lower() + assert "virtualmachinesnapshot" in c and any(t in c for t in ["spec", "source", "snapshot.kubevirt", "apiversion"]), ( + "should define VirtualMachineSnapshot resource structure" + ) + + def test_hot_plugged_blocker(self): + """Skill: Hot-plugged volumes block snapshot creation.""" + c = read_report().lower() + assert any(t in c for t in ["hot-plug", "hotplug", "hot plug", "block snapshot", "cannot snapshot"]), ( + "should address hot-plugged volumes blocking snapshots" + ) + + def test_status_indications(self): + """Skill: status.indications (GuestAgent, Online) for consistency level.""" + c = read_report().lower() + assert any(t in c for t in ["indications", "guestagent", "online", "status.phase", "inprogress", "succeeded"]), ( + "should reference snapshot status/indications" + ) + + def test_guest_agent_connected_check(self): + """Docs teach checking AgentConnected condition to determine if + application-consistent (vs crash-consistent) snapshots are possible. + Without docs, agents don't check guest agent status before snapshot.""" + c = read_report().lower() + assert any(t in c for t in [ + "agentconnected", "agent connected", "guest agent", + "application-consistent", "crash-consistent", + ]), "should check AgentConnected for snapshot consistency level" diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-delete/environment/Dockerfile b/evaluation/without_skills/rh-virt__vm-snapshot-delete/environment/Dockerfile new file mode 100644 index 00000000..a76f03e8 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-delete/environment/Dockerfile @@ -0,0 +1,50 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-delete/environment/mcp-servers/mock-virt-mcp.py b/evaluation/without_skills/rh-virt__vm-snapshot-delete/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..2e083d72 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-delete/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1458 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-delete/instruction.md b/evaluation/without_skills/rh-virt__vm-snapshot-delete/instruction.md new file mode 100644 index 00000000..3058c144 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-delete/instruction.md @@ -0,0 +1,12 @@ +# VM Snapshot Deletion Task + +You are an OpenShift Virtualization administrator. Delete snapshot `production-db-backup-20240215` for VM `production-db` in namespace `prod-vms`. + +## Requirements +- Verify the snapshot is safe to delete (no active restores, not the last snapshot) +- Include user confirmation safeguards +- Verify deletion completed + +Use MCP tools to examine the cluster. Document your methodology, findings, and deletion plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-delete/solution/solve.sh b/evaluation/without_skills/rh-virt__vm-snapshot-delete/solution/solve.sh new file mode 100644 index 00000000..11098bb3 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-delete/solution/solve.sh @@ -0,0 +1,26 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Snapshot Deletion Plan + +## Target: production-db-backup-20240215 + +### Safety Checks +1. **Restore conflict check**: Verify no active VirtualMachineRestore references this snapshot + - If snapshot is in use by a restore operation, deletion will fail +2. **Last snapshot warning**: List all snapshots for production-db + - Other snapshots exist (production-db-backup-20240301) — NOT the last snapshot + - If this were the only remaining snapshot, show explicit warning + +### Deletion Procedure +1. Verify snapshot exists (apiVersion: snapshot.kubevirt.io/v1beta1, kind: VirtualMachineSnapshot) +2. Check for active VirtualMachineRestore resources (snapshot in use blocks deletion) +3. List other snapshots for production-db via labelSelector vm.kubevirt.io/name +4. Request user confirmation (proceed yes/no) +5. Delete snapshot via resources_delete +6. Verify deletion completed +7. Impact: Storage freed, recovery point removed + +### Note +This is NOT the last snapshot — production-db-backup-20240301 remains available for restore. + +REPORT_EOF diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-delete/task.toml b/evaluation/without_skills/rh-virt__vm-snapshot-delete/task.toml new file mode 100644 index 00000000..7d13e981 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-delete/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-snapshot-delete" +name = "rh-virt VM Snapshot Deletion Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-snapshot-delete", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-delete/tests/llm_judge.py b/evaluation/without_skills/rh-virt__vm-snapshot-delete/tests/llm_judge.py new file mode 100644 index 00000000..92546360 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-delete/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "restore_conflict", "file": "/root/report.md", "question": "Does the report check for active VirtualMachineRestore before deleting a snapshot?", "reference": "A skilled report checks for active restores. An unskilled report deletes without checking conflicts."}, + {"id": "last_snapshot_warning", "file": "/root/report.md", "question": "Does the report warn when deleting the only remaining snapshot for a VM?", "reference": "A skilled report warns about loss of last recovery point. An unskilled report deletes without warning."}, + {"id": "label_selector_filter", "file": "/root/report.md", "question": "Does the report use spec.source.name or vm.kubevirt.io/name label to list other snapshots for the same VM?", "reference": "A skilled report uses proper filtering to find related snapshots. An unskilled report lists all snapshots without VM filtering."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-delete/tests/test.sh b/evaluation/without_skills/rh-virt__vm-snapshot-delete/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-delete/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-delete/tests/test_outputs.py b/evaluation/without_skills/rh-virt__vm-snapshot-delete/tests/test_outputs.py new file mode 100644 index 00000000..f7220d55 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-delete/tests/test_outputs.py @@ -0,0 +1,71 @@ +""" +Tests for rh-virt__vm-snapshot-delete per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_snapshot(self): + content = read_report().lower() + assert "snapshot" in content, "report should mention snapshots" + + def test_mentions_deletion(self): + content = read_report().lower() + assert "delet" in content, "report should discuss deletion" + + +class TestSkillDependent: + def test_restore_conflict_check(self): + """Skill: Active VirtualMachineRestore blocks snapshot deletion.""" + c = read_report().lower() + assert any(t in c for t in ["virtualmachinerestore", "restore", "in use", "active restore", "block delet"]) and ( + "restore" in c or "conflict" in c + ), ( + "should check for active restore blocking deletion" + ) + + def test_last_snapshot_warning(self): + """Skill: Warn when deleting the only snapshot for a VM.""" + c = read_report().lower() + assert any(t in c for t in ["last snapshot", "only snapshot", "no recovery", "only remaining", "no other snapshot"]) or ( + "last" in c and "snapshot" in c and ("warn" in c or "only" in c) + ), ( + "should warn when deleting the last snapshot for a VM" + ) + + def test_storage_reclaim(self): + """Skill: Storage freed by deletion; recovery point lost.""" + c = read_report().lower() + assert any(t in c for t in ["storage freed", "storage reclaim", "freed", "recovery point"]), ( + "should mention storage reclamation or recovery point loss" + ) + + def test_virtualmachinesnapshot_delete(self): + """Skill: Delete VirtualMachineSnapshot resource.""" + c = read_report().lower() + assert any(t in c for t in ["virtualmachinesnapshot", "resources_delete", "delete snapshot"]) and ( + "snapshot" in c + ), ( + "should reference VirtualMachineSnapshot deletion" + ) + + def test_list_other_snapshots(self): + """Skill: List other snapshots for same VM before delete.""" + c = read_report().lower() + assert any(t in c for t in ["spec.source.name", "label selector", "vm.kubevirt.io/name", "other snapshot", "list snapshot", "same vm"]), ( + "should list other snapshots for the VM" + ) diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-list/environment/Dockerfile b/evaluation/without_skills/rh-virt__vm-snapshot-list/environment/Dockerfile new file mode 100644 index 00000000..a76f03e8 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-list/environment/Dockerfile @@ -0,0 +1,50 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-list/environment/mcp-servers/mock-virt-mcp.py b/evaluation/without_skills/rh-virt__vm-snapshot-list/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..1d1132df --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-list/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1500 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), + ("prod-vms", {"env": "production"}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── prod-vms (instruction-specific) ────────────────────────────────── + _vm("production-db", "prod-vms", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true"}, + 8, 16, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, + # ── prod-vms / production-db (instruction-specific) ─────────────────── + { + "name": "production-db-backup-20260210", + "namespace": "prod-vms", + "vm_name": "production-db", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-10T08:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-proddb-root-20260210"}, + ], + }, + { + "name": "production-db-snap-20260218", + "namespace": "prod-vms", + "vm_name": "production-db", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-18T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-proddb-root-20260218"}, + ], + }, + { + "name": "production-db-snap-failed", + "namespace": "prod-vms", + "vm_name": "production-db", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-22T11:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-list/instruction.md b/evaluation/without_skills/rh-virt__vm-snapshot-list/instruction.md new file mode 100644 index 00000000..2c6ed187 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-list/instruction.md @@ -0,0 +1,12 @@ +# VM Snapshot Listing Task + +You are an OpenShift Virtualization administrator. List and inspect all snapshots for VM `production-db` in namespace `prod-vms`. + +## Requirements +- List all snapshots with their status and readiness +- Show creation timestamps +- Identify any failed or incomplete snapshots + +Use MCP tools to query snapshot data. Document your methodology and write the snapshot inventory in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-list/solution/solve.sh b/evaluation/without_skills/rh-virt__vm-snapshot-list/solution/solve.sh new file mode 100644 index 00000000..2e33f350 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-list/solution/solve.sh @@ -0,0 +1,30 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Snapshot Inventory + +## Snapshots for production-db in prod-vms + +### Query Method +- API: `resources_list(apiVersion="snapshot.kubevirt.io/v1beta1", kind="VirtualMachineSnapshot", namespace="prod-vms")` +- Filter: `labelSelector: vm.kubevirt.io/name=production-db` +- Fallback: If label missing, filter by `spec.source.name == "production-db"` + +### Snapshot List +| Name | Status | Ready | Created | +|------|--------|-------|---------| +| production-db-backup-20240301 | Succeeded | true | 2024-03-01T10:00:00Z | +| production-db-backup-20240215 | Succeeded | true | 2024-02-15T08:30:00Z | + +### Status Fields +- `status.phase`: InProgress, Succeeded, Failed +- `status.readyToUse`: true/false — snapshot can be used for restore +- `spec.source.name`: Source VM name +- `metadata.creationTimestamp`: Creation time + +### Actions +- Restore: "Restore VM production-db from snapshot " +- Delete: "Delete snapshot " + +### No failed or incomplete snapshots found. + +REPORT_EOF diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-list/task.toml b/evaluation/without_skills/rh-virt__vm-snapshot-list/task.toml new file mode 100644 index 00000000..3e9cc1cd --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-list/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-snapshot-list" +name = "rh-virt VM Snapshot Listing Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-snapshot-list", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-list/tests/llm_judge.py b/evaluation/without_skills/rh-virt__vm-snapshot-list/tests/llm_judge.py new file mode 100644 index 00000000..aa42d89d --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-list/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "ready_to_use_status", "file": "/root/report.md", "question": "Does the report show readyToUse status indicating which snapshots are safe to restore?", "reference": "A skilled report includes readyToUse for each snapshot. An unskilled report only shows names and dates."}, + {"id": "phase_and_creation", "file": "/root/report.md", "question": "Does the report show status.phase (Succeeded/Failed/InProgress) and creation timestamp for each snapshot?", "reference": "A skilled report includes phase and timestamp. An unskilled report shows minimal snapshot metadata."}, + {"id": "label_selector_filtering", "file": "/root/report.md", "question": "Does the report mention using the vm.kubevirt.io/name label or label selector to filter or identify snapshots belonging to a specific VM?", "reference": "A skilled report references the vm.kubevirt.io/name label for filtering snapshots by source VM, or shows label selector parameters in API calls. An unskilled report lists snapshots without mentioning the KubeVirt label-based filtering mechanism."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-list/tests/test.sh b/evaluation/without_skills/rh-virt__vm-snapshot-list/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-list/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-list/tests/test_outputs.py b/evaluation/without_skills/rh-virt__vm-snapshot-list/tests/test_outputs.py new file mode 100644 index 00000000..06ac48d3 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-list/tests/test_outputs.py @@ -0,0 +1,62 @@ +""" +Tests for rh-virt__vm-snapshot-list per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_snapshots(self): + content = read_report().lower() + assert "snapshot" in content, "report should mention snapshots" + + def test_has_structured_output(self): + content = read_report() + assert "|" in content or "- " in content, "report should have structured output (table or list)" + + +class TestSkillDependent: + def test_ready_to_use_status(self): + """Skill: readyToUse status for restore readiness.""" + c = read_report().lower() + assert any(t in c for t in ["readytouse", "ready to use", "ready for restore"]), ( + "should reference readyToUse status for snapshot readiness" + ) + + def test_creation_timestamp(self): + """Skill: metadata.creationTimestamp or creation time.""" + c = read_report().lower() + assert any(t in c for t in ["creationtimestamp", "creation timestamp", "created", "when"]), ( + "should show creation timestamp for each snapshot" + ) + + def test_phase_status(self): + """Skill: status.phase (Succeeded, Failed, InProgress).""" + c = read_report().lower() + assert any(t in c for t in ["succeeded", "failed", "inprogress", "status.phase", "phase"]) and ( + "succeeded" in c or "failed" in c or "phase" in c + ), ( + "should show phase (Succeeded/Failed/InProgress)" + ) + + def test_label_selector_for_vm_filtering(self): + """Skill teaches using vm.kubevirt.io/name label selector to + filter snapshots by source VM. Without skill, agents list all + snapshots without label-based filtering.""" + c = read_report() + assert "vm.kubevirt.io" in c or "labelSelector" in c or "label selector" in c.lower(), ( + "should reference vm.kubevirt.io/name label for snapshot filtering" + ) diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-restore/environment/Dockerfile b/evaluation/without_skills/rh-virt__vm-snapshot-restore/environment/Dockerfile new file mode 100644 index 00000000..a76f03e8 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-restore/environment/Dockerfile @@ -0,0 +1,50 @@ +FROM ubuntu:24.04 + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y \ + python3 \ + python3-pip \ + python3-venv \ + curl \ + jq \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /root + +RUN pip3 install --break-system-packages \ + pyyaml==6.0.1 \ + fastmcp + +ENV KUBECONFIG=/root/.kube/config +RUN mkdir -p /root/.kube && echo '\ +apiVersion: v1\n\ +kind: Config\n\ +current-context: ocp-prod\n\ +clusters:\n\ +- name: ocp-prod\n\ + cluster:\n\ + server: https://api.ocp-prod.example.com:6443\n\ +contexts:\n\ +- name: ocp-prod\n\ + context:\n\ + cluster: ocp-prod\n\ + user: admin\n\ + namespace: default\n\ +users:\n\ +- name: admin\n\ + user:\n\ + token: mock-token-for-testing\n' > /root/.kube/config + +COPY mcp-servers /root/.mcp-servers + +RUN echo '{ \ + "mcpServers": { \ + "openshift-virtualization": { \ + "command": "python3", \ + "args": ["/root/.mcp-servers/mock-virt-mcp.py"] \ + } \ + } \ +}' > /root/.mcp.json + +WORKDIR /root diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-restore/environment/mcp-servers/mock-virt-mcp.py b/evaluation/without_skills/rh-virt__vm-snapshot-restore/environment/mcp-servers/mock-virt-mcp.py new file mode 100644 index 00000000..2e083d72 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-restore/environment/mcp-servers/mock-virt-mcp.py @@ -0,0 +1,1458 @@ +#!/usr/bin/env python3 +""" +Mock OpenShift MCP Server for OpenShift Virtualization. + +Faithfully implements the tool interface of: + https://github.com/openshift/openshift-mcp-server +Enabled toolsets: config, core, kubevirt + +Simulated OpenShift cluster: + Cluster: ocp-virt-prod (OpenShift 4.15, K8s 1.28) + Namespaces: virt-prod-dc1, virt-prod-dc2, virt-staging, virt-dev, + openshift-cnv, openshift-compliance, openshift-monitoring, default + Nodes: 8 workers (hypervisor-class) + VMs: 32 KubeVirt VirtualMachines + Security: 5 VulnerabilityReports in openshift-compliance +""" + +import hashlib +import json +from typing import Optional + +import yaml +from fastmcp import FastMCP + +mcp = FastMCP("openshift-virtualization") + +CLUSTER = "ocp-virt-prod" +API_URL = "https://api.ocp-virt-prod.example.com:6443" +K8S_VER = "v1.28.12+f26e58e" +OCP_VER = "4.15.8" +NOW = "2026-03-02T12:00:00Z" +CREATED = "2025-11-15T10:00:00Z" + +# ═══════════════════════════════════════════════════════════════════════════ +# COMPACT DATA +# ═══════════════════════════════════════════════════════════════════════════ + +NAMESPACES = [ + ("virt-prod-dc1", {"env": "production", "dc": "dc1"}), + ("virt-prod-dc2", {"env": "production", "dc": "dc2"}), + ("virt-staging", {"env": "staging"}), + ("virt-dev", {"env": "development"}), + ("openshift-cnv", {"operator": "kubevirt-hyperconverged"}), + ("openshift-compliance", {"operator": "compliance"}), + ("openshift-monitoring", {}), + ("default", {}), +] + + +def _n(name, zone, status, unschedulable, cpu_cap, cpu_use, mem_cap, mem_use, pods, + taints=None, maint=None, itype="m5.4xlarge"): + return dict(name=name, zone=zone, status=status, unschedulable=unschedulable, + cpu_cap=cpu_cap, cpu_use=cpu_use, mem_cap=mem_cap, mem_use=mem_use, + pods=pods, taints=taints or [], maint=maint, itype=itype) + + +NODES = [ + _n("hv-prod-dc1-01", "dc1", "Ready", False, 16000, 11840, 65536, 44564, 12), + _n("hv-prod-dc1-02", "dc1", "Ready", False, 16000, 14080, 65536, 53739, 14), + _n("hv-prod-dc1-03", "dc1", "Ready,SchedulingDisabled", True, 16000, 1920, 65536, 9830, 6, + taints=[{"key": "node.kubernetes.io/unschedulable", "effect": "NoSchedule"}], + maint="Scheduled firmware update — ETA 6 hours"), + _n("hv-prod-dc2-01", "dc2", "Ready", False, 16000, 11360, 65536, 41287, 12), + _n("hv-prod-dc2-02", "dc2", "Ready", False, 16000, 12640, 65536, 49807, 15), + _n("hv-staging-01", "staging", "Ready", False, 8000, 4160, 32768, 15728, 10, itype="m5.2xlarge"), + _n("hv-staging-02", "staging", "Ready", False, 8000, 3040, 32768, 11468, 8, itype="m5.2xlarge"), + _n("hv-dev-01", "dev", "Ready", False, 8000, 4880, 32768, 18022, 14, itype="m5.2xlarge"), +] + + +def _vm(name, ns, node, os, env, labels, cpu, mem, status, ready, last_seen, + conds=None, pinned=False): + return dict(name=name, ns=ns, node=node, os=os, env=env, labels=labels, + cpu=cpu, mem=mem, status=status, ready=ready, + last_seen=last_seen, conds=conds or [], pinned=pinned) + + +VMS = [ + # ── virt-prod-dc1 / hv-prod-dc1-01 (4) ────────────────────────────── + _vm("vm-web-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true", "compliance/soc2": "true", + "criticality": "high", "customer-facing": "true"}, 4, 8, "Running", True, 1), + _vm("vm-web-prod-02", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "web", "compliance/pci-dss": "true"}, 4, 8, "Running", True, 1), + _vm("vm-lb-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-8.8", "production", + {"app": "lb", "criticality": "high", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-monitor-prod-01", "virt-prod-dc1", "hv-prod-dc1-01", "rhel-9.3", "production", + {"app": "monitoring"}, 2, 4, "Running", True, 1), + + # ── virt-prod-dc1 / hv-prod-dc1-02 (4 — CRITICAL utilization) ─────── + _vm("vm-web-prod-03", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "web", "customer-facing": "true"}, 4, 8, "Running", True, 2), + _vm("vm-api-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true", "criticality": "high"}, 4, 8, "Running", True, 1), + _vm("vm-cache-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "cache", "ha": "true"}, 2, 4, "Running", True, 1), + _vm("vm-etl-prod-01", "virt-prod-dc1", "hv-prod-dc1-02", "rhel-8.9", "production", + {"app": "etl", "compliance/hipaa": "true"}, + 4, 8, "Running", True, 1, + conds=[("Degraded", "True", "High I/O latency: avg write latency 45ms (threshold 20ms)")]), + + # ── virt-prod-dc1 / hv-prod-dc1-03 (2 — MAINTENANCE node) ─────────── + _vm("vm-backup-prod-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-8.8", "production", + {"app": "backup", "criticality": "low"}, 2, 4, "Stopped", False, 3, pinned=True), + _vm("vm-legacy-auth-01", "virt-prod-dc1", "hv-prod-dc1-03", "rhel-7.9", None, + {"app": "auth", "criticality": "high", "legacy": "true"}, + 2, 4, "Running", True, 3, + conds=[("Degraded", "True", "EOL operating system: RHEL 7.9 reached end of life")]), + + # ── virt-prod-dc2 / hv-prod-dc2-01 (4) ────────────────────────────── + _vm("vm-api-prod-02", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "api", "compliance/soc2": "true"}, 4, 8, "Running", True, 2), + _vm("vm-db-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/pci-dss": "true", + "compliance/soc2": "true"}, 8, 16, "Running", True, 1), + _vm("vm-queue-prod-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-9.2", "production", + {"app": "queue", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + _vm("vm-legacy-pay-01", "virt-prod-dc2", "hv-prod-dc2-01", "rhel-8.7", None, + {"app": "payment-gateway", "criticality": "high", "legacy": "true"}, + 4, 8, "Running", True, 2), + + # ── virt-prod-dc2 / hv-prod-dc2-02 (5 — WARNING utilization) ──────── + _vm("vm-db-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.3", "production", + {"app": "db", "criticality": "high", "compliance/soc2": "true"}, + 8, 16, "Running", True, 1), + _vm("vm-cache-prod-02", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "cache"}, 2, 4, "Running", False, 12, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 12 days")]), + _vm("vm-batch-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.9", "production", + {"app": "batch"}, 4, 8, "Stopped", False, 4), + _vm("vm-legacy-reports-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-8.6", None, + {"app": "financial-reporting", "legacy": "true"}, + 2, 4, "Running", True, 6), + _vm("vm-log-prod-01", "virt-prod-dc2", "hv-prod-dc2-02", "rhel-9.2", "production", + {"app": "logging", "compliance/soc2": "true"}, 2, 4, "Running", True, 1), + + # ── virt-staging / hv-staging-01 (4) ───────────────────────────────── + _vm("vm-web-stg-01", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 1), + _vm("vm-web-stg-02", "virt-staging", "hv-staging-01", "rhel-9.2", "staging", + {"app": "web"}, 2, 4, "Running", True, 2), + _vm("vm-api-stg-01", "virt-staging", "hv-staging-01", "rhel-8.9", "staging", + {"app": "api"}, 2, 4, "Running", True, 2), + _vm("vm-perf-stg-01", "virt-staging", "hv-staging-01", "rhel-9.3", "staging", + {"app": "perf-test"}, 4, 8, "Running", True, 1), + + # ── virt-staging / hv-staging-02 (3) ───────────────────────────────── + _vm("vm-db-stg-01", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Running", True, 1), + _vm("vm-db-stg-02", "virt-staging", "hv-staging-02", "rhel-9.2", "staging", + {"app": "db"}, 4, 8, "Paused", False, 3), + _vm("vm-qa-stg-01", "virt-staging", "hv-staging-02", "rhel-8.9", "staging", + {"app": "qa"}, 2, 4, "Running", True, 1), + + # ── virt-dev / hv-dev-01 (6) ───────────────────────────────────────── + _vm("vm-dev-01", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-02", "virt-dev", "hv-dev-01", "rhel-8.8", "development", + {"app": "dev"}, 2, 4, "Running", True, 2), + _vm("vm-dev-03", "virt-dev", "hv-dev-01", "rhel-8.9", "development", + {"app": "dev"}, 2, 4, "Stopped", False, 14, + conds=[("AgentConnected", "False", "Guest agent not responding")]), + _vm("vm-sandbox-01", "virt-dev", "hv-dev-01", "rhel-9.2", "development", + {"app": "sandbox"}, 2, 4, "Running", True, 1), + _vm("vm-test-01", "virt-dev", "hv-dev-01", "rhel-9.3", "development", + {"app": "test"}, 2, 4, "Running", True, 1), + _vm("vm-archive-01", "virt-dev", "hv-dev-01", "rhel-8.6", "development", + {"app": "archive", "legacy": "true"}, + 2, 4, "Running", False, 45, + conds=[("AgentConnected", "False", + "Guest agent has not responded for 45 days")]), +] + + +def _adv(adv_id, name, synopsis, severity, cvss, compliance, deadline, + description, affected, remediation_available=True): + return dict(id=adv_id, name=name, synopsis=synopsis, severity=severity, + cvss=cvss, compliance=compliance, deadline=deadline, + description=description, affected=affected, + remediation_available=remediation_available) + + +ADVISORIES = [ + _adv("RHSA-2026:1234", "rhsa-2026-1234", + "Critical: kernel security update", "Critical", 9.8, + ["pci-dss", "soc2"], 30, + "Remote code execution in kernel network stack allows unauthenticated " + "attackers to execute arbitrary code via crafted packets.", + [("vm-web-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-02", "virt-prod-dc1", "Vulnerable"), + ("vm-db-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-web-stg-01", "virt-staging", "Remediated"), + ("vm-web-stg-02", "virt-staging", "Remediated")]), + _adv("RHSA-2026:2345", "rhsa-2026-2345", + "Important: openssl security update", "Important", 7.8, + ["soc2"], 60, + "Buffer overflow in OpenSSL TLS handshake processing allows " + "authenticated attackers to escalate privileges.", + [("vm-api-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-api-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-db-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-queue-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-log-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-api-stg-01", "virt-staging", "Remediated"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:3456", "rhsa-2026-3456", + "Moderate: glibc security update", "Moderate", 5.4, + ["hipaa"], 90, + "Information disclosure in glibc DNS resolver allows adjacent " + "network attackers to read portions of process memory.", + [("vm-etl-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-cache-prod-02", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-01", "virt-dev", "Vulnerable"), + ("vm-dev-02", "virt-dev", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-dev-03", "virt-dev", "Remediated"), + ("vm-archive-01", "virt-dev", "Remediated")]), + _adv("RHSA-2026:4567", "rhsa-2026-4567", + "Important: httpd security update", "Important", 7.2, + ["pci-dss"], 90, + "Request smuggling in Apache httpd allows attackers to bypass " + "access controls on payment-processing endpoints.", + [("vm-legacy-pay-01", "virt-prod-dc2", "Vulnerable"), + ("vm-lb-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-auth-01", "virt-prod-dc1", "Vulnerable"), + ("vm-web-prod-03", "virt-prod-dc1", "Vulnerable"), + ("vm-legacy-reports-01", "virt-prod-dc2", "Remediated")]), + _adv("RHSA-2026:5678", "rhsa-2026-5678", + "Low: systemd information disclosure", "Low", 3.1, + [], None, + "Information disclosure in systemd-journald allows local users to " + "read journal entries from other user sessions under specific " + "SELinux configurations.", + [("vm-monitor-prod-01", "virt-prod-dc1", "Vulnerable"), + ("vm-batch-prod-01", "virt-prod-dc2", "Vulnerable"), + ("vm-db-stg-02", "virt-staging", "Vulnerable"), + ("vm-archive-01", "virt-dev", "Vulnerable")], + remediation_available=False), +] + +# Build per-VM advisory lookup +_VM_ADV = {} +for _a in ADVISORIES: + for _vn, _vns, _vs in _a["affected"]: + _VM_ADV.setdefault(_vn, []).append( + {"id": _a["id"], "severity": _a["severity"], "status": _vs, + "remediationAvailable": _a["remediation_available"]}) + +EVENTS = [ + ("virt-prod-dc1", "Warning", "NodeSchedulingDisabled", + "Node/hv-prod-dc1-03", + "Node cordoned for maintenance: Scheduled firmware update — ETA 6 hours"), + ("virt-prod-dc2", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-cache-prod-02", + "Guest agent has not responded for 12 days — last contact 2026-02-18"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-archive-01", + "Guest agent has not responded for 45 days — last contact 2026-01-16"), + ("virt-dev", "Warning", "GuestAgentNotResponding", + "VirtualMachine/vm-dev-03", + "Guest agent not responding — VM stopped for 14 days"), + ("virt-prod-dc1", "Warning", "HighIOLatency", + "VirtualMachineInstance/vm-etl-prod-01", + "Average write latency 45ms exceeds threshold 20ms"), + ("virt-prod-dc1", "Warning", "EOLOperatingSystem", + "VirtualMachine/vm-legacy-auth-01", + "RHEL 7.9 has reached end of life — no further security updates"), + ("virt-prod-dc2", "Normal", "GracefulShutdown", + "VirtualMachine/vm-batch-prod-01", + "VM stopped by scheduler after batch job completion"), + ("virt-staging", "Normal", "UserPaused", + "VirtualMachineInstance/vm-db-stg-02", + "VM paused by user request"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-1234", + "Vulnerability scan completed: 6 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-2345", + "Vulnerability scan completed: 7 affected VMs, 5 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-3456", + "Vulnerability scan completed: 8 affected VMs, 6 vulnerable"), + ("openshift-compliance", "Normal", "ScanCompleted", + "VulnerabilityReport/rhsa-2026-4567", + "Vulnerability scan completed: 5 affected VMs, 4 vulnerable"), + ("openshift-compliance", "Warning", "NoRemediationAvailable", + "VulnerabilityReport/rhsa-2026-5678", + "Advisory RHSA-2026:5678 has no vendor remediation — " + "compensating controls required for 4 vulnerable VMs"), +] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE BUILDERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _os_parts(os_str): + """Parse 'rhel-9.3' into (id, version, pretty).""" + parts = os_str.split("-", 1) + oid = parts[0] + ver = parts[1] if len(parts) > 1 else "" + major = ver.split(".")[0] if ver else "" + pretty = f"Red Hat Enterprise Linux {major} ({ver})" if oid == "rhel" else os_str + return oid, ver, pretty + + +def _uid(name): + return hashlib.md5(name.encode()).hexdigest()[:8] + "-0000-0000-0000-" + \ + hashlib.md5(name.encode()).hexdigest()[:12] + + +def _pod_hash(name): + return hashlib.md5(name.encode()).hexdigest()[:5] + + +def _firmware_uuid(name): + h = hashlib.sha256(name.encode()).hexdigest() + return f"{h[:8]}-{h[8:12]}-4{h[13:16]}-{h[16:20]}-{h[20:32]}" + + +def _firmware_serial(name): + h = hashlib.sha256((name + "-serial").encode()).hexdigest()[:12] + return f"sn-{h}" + + +def _build_vm(vm): + """Build a kubevirt.io/v1 VirtualMachine resource dict.""" + labels = {"kubevirt.io/domain": vm["name"], "vm.kubevirt.io/name": vm["name"]} + if vm["env"]: + labels["env"] = vm["env"] + labels.update(vm["labels"]) + + annotations = {"vm.kubevirt.io/os": vm["os"]} + adv_map = _VM_ADV.get(vm["name"]) + if adv_map: + annotations["security.openshift.io/vulnerabilities"] = json.dumps( + {a["id"]: a["status"] for a in adv_map}) + + is_running = vm["status"] in ("Running", "Paused") + conditions = [ + {"type": "Ready", "status": str(vm["ready"]), + "lastTransitionTime": CREATED}, + ] + agent_connected = True + for ct, cs, cm in vm["conds"]: + if ct == "AgentConnected": + agent_connected = False + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + else: + conditions.append({"type": ct, "status": cs, "message": cm, + "lastTransitionTime": CREATED}) + if agent_connected and is_running: + conditions.append({"type": "AgentConnected", "status": "True", + "lastTransitionTime": CREATED}) + + res = { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachine", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "running": is_running, + "template": { + "metadata": {"labels": { + "kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"], + }}, + "spec": { + "domain": { + "cpu": {"cores": vm["cpu"], "sockets": 1, "threads": 1}, + "memory": {"guest": f"{vm['mem']}Gi"}, + "resources": { + "requests": {"cpu": str(vm["cpu"]), + "memory": f"{vm['mem']}Gi"}, + }, + "firmware": { + "uuid": _firmware_uuid(vm["name"]), + "serial": _firmware_serial(vm["name"]), + }, + }, + "volumes": [ + {"name": "rootdisk", + "persistentVolumeClaim": { + "claimName": f"{vm['name']}-rootdisk"}}, + ], + }, + }, + }, + "status": { + "printableStatus": vm["status"], + "ready": vm["ready"], + "created": True, + "conditions": conditions, + }, + } + if vm.get("pinned"): + res["spec"]["template"]["spec"]["nodeSelector"] = { + "kubernetes.io/hostname": vm["node"] + } + return res + + +def _build_vmi(vm): + """Build a kubevirt.io/v1 VirtualMachineInstance (only for running/paused VMs).""" + if vm["status"] not in ("Running", "Paused"): + return None + oid, ver, pretty = _os_parts(vm["os"]) + phase = "Running" if vm["status"] == "Running" else "Paused" + ip_hash = int(hashlib.md5(vm["name"].encode()).hexdigest()[:4], 16) + ip = f"10.244.{(ip_hash >> 8) & 0xFF}.{ip_hash & 0xFF}" + + conditions = [{"type": "Ready", "status": str(vm["ready"])}] + for ct, cs, cm in vm["conds"]: + conditions.append({"type": ct, "status": cs, "message": cm}) + + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "metadata": { + "name": vm["name"], + "namespace": vm["ns"], + "uid": _uid(vm["name"] + "-vmi"), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", "kind": "VirtualMachine", + "name": vm["name"], "uid": _uid(vm["name"]), + }], + "creationTimestamp": CREATED, + }, + "status": { + "phase": phase, + "nodeName": vm["node"], + "guestOSInfo": {"id": oid, "version": ver, "prettyName": pretty}, + "interfaces": [{"ipAddress": ip, "name": "default"}], + "conditions": conditions, + "migrationMethod": "LiveMigration", + "activePods": {_uid(vm["name"] + "-pod"): vm["node"]}, + }, + } + + +def _build_node(n): + """Build a v1/Node resource dict.""" + labels = { + "kubernetes.io/hostname": n["name"], + "node-role.kubernetes.io/worker": "", + "topology.kubernetes.io/zone": n["zone"], + "node.kubernetes.io/instance-type": n["itype"], + } + if not n["unschedulable"]: + labels["kubevirt.io/schedulable"] = "true" + annotations = {} + if n["maint"]: + annotations["machine.openshift.io/maintenance"] = n["maint"] + + conditions = [{"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}] + if n["unschedulable"]: + conditions.append({"type": "MemoryPressure", "status": "False"}) + conditions.append({"type": "DiskPressure", "status": "False"}) + + cpu_str = str(n["cpu_cap"] // 1000) + mem_ki = n["mem_cap"] * 1024 + + res = { + "apiVersion": "v1", + "kind": "Node", + "metadata": { + "name": n["name"], + "uid": _uid(n["name"]), + "labels": labels, + "annotations": annotations, + "creationTimestamp": CREATED, + }, + "spec": { + "unschedulable": n["unschedulable"], + }, + "status": { + "conditions": conditions, + "capacity": { + "cpu": cpu_str, "memory": f"{mem_ki}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "allocatable": { + "cpu": f"{n['cpu_cap'] - 200}m", + "memory": f"{mem_ki - 1024}Ki", "pods": "250", + "devices.kubevirt.io/kvm": "1", + "devices.kubevirt.io/tun": "1", + "devices.kubevirt.io/vhost-net": "1", + }, + "nodeInfo": { + "kubeletVersion": K8S_VER, + "osImage": "Red Hat Enterprise Linux CoreOS 415.92.202402130034-0", + "containerRuntimeVersion": "cri-o://1.28.4", + "kernelVersion": "5.14.0-284.52.1.el9_2.x86_64", + "architecture": "amd64", + "operatingSystem": "linux", + }, + }, + } + if n["taints"]: + res["spec"]["taints"] = n["taints"] + return res + + +def _build_vuln_report(adv): + """Build a security.openshift.io/v1 VulnerabilityReport resource.""" + vuln_count = sum(1 for _, _, s in adv["affected"] if s == "Vulnerable") + rem_count = sum(1 for _, _, s in adv["affected"] if s == "Remediated") + return { + "apiVersion": "security.openshift.io/v1", + "kind": "VulnerabilityReport", + "metadata": { + "name": adv["name"], + "namespace": "openshift-compliance", + "uid": _uid(adv["name"]), + "labels": { + "advisory-id": adv["id"], + "severity": adv["severity"].lower(), + }, + "creationTimestamp": CREATED, + }, + "spec": { + "advisoryId": adv["id"], + "synopsis": adv["synopsis"], + "severity": adv["severity"], + "cvssScore": adv["cvss"], + "complianceImpact": adv["compliance"], + "remediationDeadlineDays": adv["deadline"], + "remediationAvailable": adv["remediation_available"], + "description": adv["description"], + "affectedWorkloads": [ + {"name": vn, "namespace": vns, "kind": "VirtualMachine", + "status": vs, "remediationAvailable": adv["remediation_available"]} + for vn, vns, vs in adv["affected"] + ], + }, + "status": { + "phase": "Completed", + "totalAffected": len(adv["affected"]), + "totalVulnerable": vuln_count, + "totalRemediated": rem_count, + "lastScanTime": NOW, + }, + } + + +def _build_ns(name, labels): + return { + "apiVersion": "v1", "kind": "Namespace", + "metadata": {"name": name, "uid": _uid(name), "labels": labels, + "creationTimestamp": CREATED}, + "status": {"phase": "Active"}, + } + + +_STORAGE_SIZES = { + "db": "100Gi", "web": "50Gi", "api": "50Gi", "cache": "30Gi", + "queue": "30Gi", "monitoring": "30Gi", "logging": "30Gi", +} + + +_RWO_VMS = {"vm-backup-prod-01", "vm-batch-prod-01", "vm-archive-01"} + +def _build_pvc(vm): + """Build a v1/PersistentVolumeClaim for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "v1", + "kind": "PersistentVolumeClaim", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-pvc"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + "status": { + "phase": "Bound", + "capacity": {"storage": size}, + "accessModes": [access], + }, + } + + +def _build_datavolume(vm): + """Build a cdi.kubevirt.io/v1beta1 DataVolume for a VM's rootdisk.""" + app = vm["labels"].get("app", "") + size = _STORAGE_SIZES.get(app, "30Gi") + access = "ReadWriteOnce" if vm["name"] in _RWO_VMS else "ReadWriteMany" + return { + "apiVersion": "cdi.kubevirt.io/v1beta1", + "kind": "DataVolume", + "metadata": { + "name": f"{vm['name']}-rootdisk", + "namespace": vm["ns"], + "uid": _uid(f"{vm['name']}-dv"), + "labels": { + "vm.kubevirt.io/name": vm["name"], + "app.kubernetes.io/managed-by": "cdi-controller", + }, + "creationTimestamp": CREATED, + }, + "spec": { + "source": {"pvc": {"namespace": vm["ns"], + "name": f"{vm['name']}-rootdisk-source"}}, + "pvc": { + "accessModes": [access], + "resources": {"requests": {"storage": size}}, + "storageClassName": "ocs-storagecluster-ceph-rbd", + "volumeMode": "Block", + }, + }, + "status": { + "phase": "Succeeded", + "progress": "100.0%", + "conditions": [ + {"type": "Ready", "status": "True", + "lastTransitionTime": CREATED}, + {"type": "Bound", "status": "True", + "lastTransitionTime": CREATED}, + ], + }, + } + + +SNAPSHOTS = [ + { + "name": "vm-db-prod-01-backup-20260201", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-01T08:00:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260201"}, + ], + }, + { + "name": "vm-db-prod-01-backup-20260215", + "namespace": "virt-prod-dc2", + "vm_name": "vm-db-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-15T10:30:00Z", + "indications": ["Online", "GuestAgent"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-db01-root-20260215"}, + ], + }, + { + "name": "vm-web-prod-01-snap-20260220", + "namespace": "virt-prod-dc1", + "vm_name": "vm-web-prod-01", + "phase": "Succeeded", + "ready_to_use": True, + "creation": "2026-02-20T14:00:00Z", + "indications": ["Online"], + "volume_statuses": [ + {"name": "rootdisk", "volumeSnapshotName": "vsnap-web01-root-20260220"}, + ], + }, + { + "name": "vm-etl-prod-01-snap-failed", + "namespace": "virt-prod-dc1", + "vm_name": "vm-etl-prod-01", + "phase": "Failed", + "ready_to_use": False, + "creation": "2026-02-25T09:00:00Z", + "indications": [], + "volume_statuses": [], + "error": "VolumeSnapshot creation timed out for rootdisk", + }, +] + +RESTORES = [ + { + "name": "restore-vm-web-prod-01-20260220", + "namespace": "virt-prod-dc1", + "target_vm": "vm-web-prod-01", + "snapshot_name": "vm-web-prod-01-snap-20260220", + "complete": True, + "creation": "2026-02-22T16:00:00Z", + }, +] + +MIGRATIONS = [ + { + "name": "migration-vm-web-prod-03", + "namespace": "virt-prod-dc1", + "vmi_name": "vm-web-prod-03", + "phase": "Succeeded", + "source_node": "hv-prod-dc1-02", + "target_node": "hv-prod-dc1-01", + "creation": "2026-02-28T11:00:00Z", + }, +] + + +def _build_snapshot(snap): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineSnapshot resource.""" + res = { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineSnapshot", + "metadata": { + "name": snap["name"], + "namespace": snap["namespace"], + "uid": _uid(snap["name"]), + "labels": {"vm.kubevirt.io/name": snap["vm_name"]}, + "creationTimestamp": snap["creation"], + }, + "spec": { + "source": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": snap["vm_name"], + }, + }, + "status": { + "phase": snap["phase"], + "readyToUse": snap["ready_to_use"], + "creationTime": snap["creation"], + "indications": snap["indications"], + "volumeSnapshotStatus": snap["volume_statuses"], + }, + } + if snap.get("error"): + res["status"]["error"] = {"message": snap["error"]} + return res + + +def _build_restore(restore): + """Build a snapshot.kubevirt.io/v1beta1 VirtualMachineRestore resource.""" + return { + "apiVersion": "snapshot.kubevirt.io/v1beta1", + "kind": "VirtualMachineRestore", + "metadata": { + "name": restore["name"], + "namespace": restore["namespace"], + "uid": _uid(restore["name"]), + "creationTimestamp": restore["creation"], + }, + "spec": { + "target": { + "apiGroup": "kubevirt.io", + "kind": "VirtualMachine", + "name": restore["target_vm"], + }, + "virtualMachineSnapshotName": restore["snapshot_name"], + }, + "status": { + "complete": restore["complete"], + "restoreTime": restore["creation"], + }, + } + + +def _build_migration(mig): + """Build a kubevirt.io/v1 VirtualMachineInstanceMigration resource.""" + return { + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstanceMigration", + "metadata": { + "name": mig["name"], + "namespace": mig["namespace"], + "uid": _uid(mig["name"]), + "creationTimestamp": mig["creation"], + }, + "spec": { + "vmiName": mig["vmi_name"], + }, + "status": { + "phase": mig["phase"], + "migrationState": { + "sourceNode": mig["source_node"], + "targetNode": mig["target_node"], + "completed": mig["phase"] == "Succeeded", + "startTimestamp": mig["creation"], + }, + }, + } + + +def _build_pod(vm): + """Build a virt-launcher Pod for a running/paused VM.""" + if vm["status"] not in ("Running", "Paused"): + return None + pod_name = f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}" + return { + "apiVersion": "v1", "kind": "Pod", + "metadata": { + "name": pod_name, "namespace": vm["ns"], + "uid": _uid(pod_name), + "labels": {"kubevirt.io/domain": vm["name"], + "vm.kubevirt.io/name": vm["name"]}, + "ownerReferences": [{ + "apiVersion": "kubevirt.io/v1", + "kind": "VirtualMachineInstance", + "name": vm["name"], + }], + "creationTimestamp": CREATED, + }, + "spec": {"nodeName": vm["node"]}, + "status": { + "phase": "Running", + "containerStatuses": [{ + "name": "compute", "ready": True, + "state": {"running": {"startedAt": CREATED}}, + }], + }, + } + + +# ═══════════════════════════════════════════════════════════════════════════ +# FORMATTING HELPERS +# ═══════════════════════════════════════════════════════════════════════════ + +def _table(headers, rows): + """Format as a kubectl-style table with dynamic column widths.""" + widths = [len(h) for h in headers] + str_rows = [[str(c) for c in r] for r in rows] + for r in str_rows: + for i, c in enumerate(r): + if i < len(widths): + widths[i] = max(widths[i], len(c)) + lines = [" ".join(h.ljust(widths[i]) for i, h in enumerate(headers))] + for r in str_rows: + lines.append(" ".join(c.ljust(widths[i]) for i, c in enumerate(r))) + return "\n".join(lines) + + +def _to_yaml(resource): + return yaml.dump(resource, default_flow_style=False, sort_keys=False) + + +def _match_labels(labels, selector_str): + if not selector_str: + return True + for sel in selector_str.split(","): + sel = sel.strip() + if "!=" in sel: + k, v = sel.split("!=", 1) + if labels.get(k.strip()) == v.strip(): + return False + elif "=" in sel: + k, v = sel.split("=", 1) + if labels.get(k.strip()) != v.strip(): + return False + elif sel.startswith("!"): + if sel[1:] in labels: + return False + elif sel not in labels: + return False + return True + + +def _filter_by_ns(resources, namespace): + if namespace is None: + return resources + return [r for r in resources if r.get("metadata", {}).get("namespace") == namespace] + + +# ═══════════════════════════════════════════════════════════════════════════ +# RESOURCE DISPATCH +# ═══════════════════════════════════════════════════════════════════════════ + +def _all_resources(api_version, kind): + """Return (resources_list, table_headers, row_extractor, is_namespaced).""" + if api_version == "kubevirt.io/v1" and kind == "VirtualMachine": + resources = [_build_vm(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["printableStatus"], + str(s["ready"]), "30d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstance": + resources = [_build_vmi(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "PHASE", "IP", "NODENAME", "READY", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + ip = s.get("interfaces", [{}])[0].get("ipAddress", "") + return [m["namespace"], m["name"], s["phase"], ip, + s.get("nodeName", ""), str(s.get("conditions", [{}])[0].get("status", "")), "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Node": + resources = [_build_node(n) for n in NODES] + headers = ["NAME", "STATUS", "ROLES", "AGE", "VERSION"] + def row(r): + m = r["metadata"] + s = r.get("spec", {}) + status = "Ready,SchedulingDisabled" if s.get("unschedulable") else "Ready" + return [m["name"], status, "worker", "60d", K8S_VER] + return resources, headers, row, False + + if api_version == "v1" and kind == "Namespace": + resources = [_build_ns(n, lb) for n, lb in NAMESPACES] + headers = ["NAME", "STATUS", "AGE"] + def row(r): + return [r["metadata"]["name"], r["status"]["phase"], "60d"] + return resources, headers, row, False + + if api_version == "security.openshift.io/v1" and kind == "VulnerabilityReport": + resources = [_build_vuln_report(a) for a in ADVISORIES] + headers = ["NAMESPACE", "NAME", "SEVERITY", "CVSS", "AFFECTED", "VULNERABLE", "AGE"] + def row(r): + s = r["status"] + sp = r["spec"] + return [r["metadata"]["namespace"], r["metadata"]["name"], + sp["severity"], str(sp["cvssScore"]), + str(s["totalAffected"]), str(s["totalVulnerable"]), "5d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "Pod": + resources = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + def row(r): + m = r["metadata"] + return [m["namespace"], m["name"], "1/1", "Running", "0", "30d"] + return resources, headers, row, True + + if api_version == "v1" and kind == "PersistentVolumeClaim": + resources = [_build_pvc(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "STATUS", "VOLUME", "CAPACITY", "ACCESS MODES", "STORAGECLASS", "AGE"] + def row(r): + m = r["metadata"] + cap = r["status"].get("capacity", {}).get("storage", "") + sc = r["spec"].get("storageClassName", "") + am = ",".join(a.replace("ReadWriteMany", "RWX").replace("ReadWriteOnce", "RWO") + for a in r["spec"].get("accessModes", [])) + return [m["namespace"], m["name"], "Bound", _uid(m["name"]), cap, am, sc, "30d"] + return resources, headers, row, True + + if api_version == "cdi.kubevirt.io/v1beta1" and kind == "DataVolume": + resources = [_build_datavolume(vm) for vm in VMS] + headers = ["NAMESPACE", "NAME", "PHASE", "PROGRESS", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], s.get("progress", ""), "30d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineSnapshot": + resources = [_build_snapshot(s) for s in SNAPSHOTS] + headers = ["NAMESPACE", "NAME", "PHASE", "READY", "VM", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + vm_name = r["spec"]["source"]["name"] + return [m["namespace"], m["name"], s["phase"], + str(s["readyToUse"]), vm_name, "5d"] + return resources, headers, row, True + + if api_version == "snapshot.kubevirt.io/v1beta1" and kind == "VirtualMachineRestore": + resources = [_build_restore(r) for r in RESTORES] + headers = ["NAMESPACE", "NAME", "TARGET", "SNAPSHOT", "COMPLETE", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], + r["spec"]["target"]["name"], + r["spec"]["virtualMachineSnapshotName"], + str(s["complete"]), "3d"] + return resources, headers, row, True + + if api_version == "kubevirt.io/v1" and kind == "VirtualMachineInstanceMigration": + resources = [_build_migration(m) for m in MIGRATIONS] + headers = ["NAMESPACE", "NAME", "PHASE", "VMI", "AGE"] + def row(r): + m = r["metadata"] + s = r["status"] + return [m["namespace"], m["name"], s["phase"], + r["spec"]["vmiName"], "2d"] + return resources, headers, row, True + + return [], [], None, True + + +# ═══════════════════════════════════════════════════════════════════════════ +# CONFIG TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def configuration_view(minified: bool = True) -> str: + """Get the current Kubernetes configuration content as a kubeconfig YAML.""" + cfg = { + "apiVersion": "v1", "kind": "Config", + "current-context": CLUSTER, + "clusters": [{"name": CLUSTER, "cluster": {"server": API_URL}}], + "contexts": [{"name": CLUSTER, "context": { + "cluster": CLUSTER, "user": "admin", "namespace": "default"}}], + "users": [{"name": "admin", "user": { + "token": "[REDACTED]"}}], + } + return yaml.dump(cfg, default_flow_style=False, sort_keys=False) + + +@mcp.tool() +def configuration_contexts_list() -> str: + """List all available context names and associated server urls from the kubeconfig file.""" + return _table( + ["CURRENT", "NAME", "CLUSTER", "AUTHINFO", "NAMESPACE"], + [["*", CLUSTER, CLUSTER, "admin", "default"]]) + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: RESOURCES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def resources_list( + apiVersion: str, + kind: str, + namespace: Optional[str] = None, + labelSelector: Optional[str] = None, + fieldSelector: Optional[str] = None, +) -> str: + """List Kubernetes resources by apiVersion and kind, optionally filtered by namespace and label selector.""" + resources, headers, row_fn, is_namespaced = _all_resources(apiVersion, kind) + if not resources and row_fn is None: + return f"error: the server doesn't have a resource type \"{kind}\"" + + if is_namespaced and namespace: + resources = _filter_by_ns(resources, namespace) + if labelSelector: + resources = [r for r in resources + if _match_labels(r.get("metadata", {}).get("labels", {}), + labelSelector)] + if fieldSelector: + for sel in fieldSelector.split(","): + if "=" in sel: + k, v = sel.split("=", 1) + k, v = k.strip(), v.strip() + if k == "status.printableStatus": + resources = [r for r in resources + if r.get("status", {}).get("printableStatus") == v] + elif k == "metadata.name": + resources = [r for r in resources + if r.get("metadata", {}).get("name") == v] + elif k == "spec.nodeName": + resources = [r for r in resources + if r.get("spec", {}).get("nodeName") == v or + r.get("status", {}).get("nodeName") == v or + r.get("spec", {}).get("template", {}).get("spec", {}) + .get("nodeSelector", {}).get("kubernetes.io/hostname") == v] + + if not resources: + ns_msg = f" in namespace \"{namespace}\"" if namespace else "" + return f"No resources found{ns_msg}." + + show_ns = is_namespaced and namespace is None + h = headers if show_ns else [h for h in headers if h != "NAMESPACE"] + rows = [] + for r in resources: + full_row = row_fn(r) + if show_ns: + rows.append(full_row) + else: + ns_idx = headers.index("NAMESPACE") if "NAMESPACE" in headers else -1 + rows.append([c for i, c in enumerate(full_row) if i != ns_idx]) + return _table(h, rows) + + +@mcp.tool() +def resources_get( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, +) -> str: + """Get a Kubernetes resource by apiVersion, kind, and name, returned as YAML.""" + resources, _, _, is_namespaced = _all_resources(apiVersion, kind) + for r in resources: + m = r.get("metadata", {}) + if m.get("name") != name: + continue + if is_namespaced and namespace and m.get("namespace") != namespace: + continue + return _to_yaml(r) + kind_lower = kind.lower() + "s" + return f'Error from server (NotFound): {kind_lower}.{apiVersion.split("/")[0]} "{name}" not found' + + +@mcp.tool() +def resources_create_or_update(resource: str) -> str: + """Create or update a Kubernetes resource (YAML or JSON).""" + try: + data = yaml.safe_load(resource) + name = data.get("metadata", {}).get("name", "unknown") + kind = data.get("kind", "unknown") + return f'{kind} "{name}" configured' + except Exception as e: + return f"Error: invalid resource definition: {e}" + + +@mcp.tool() +def resources_delete( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + gracePeriodSeconds: Optional[int] = None, +) -> str: + """Delete a Kubernetes resource.""" + return f'{kind} "{name}" deleted' + + +@mcp.tool() +def resources_scale( + apiVersion: str, + kind: str, + name: str, + namespace: Optional[str] = None, + scale: Optional[int] = None, +) -> str: + """Get or update the scale of a Kubernetes resource.""" + return f'Error: {kind} does not support scaling' + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: NAMESPACES, EVENTS, NODES +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def namespaces_list() -> str: + """List all Kubernetes namespaces in the current cluster.""" + headers = ["NAME", "STATUS", "AGE"] + rows = [[n, "Active", "60d"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def projects_list() -> str: + """List all OpenShift projects in the current cluster.""" + headers = ["NAME", "DISPLAY NAME", "STATUS"] + rows = [[n, "", "Active"] for n, _ in NAMESPACES] + return _table(headers, rows) + + +@mcp.tool() +def events_list(namespace: Optional[str] = None) -> str: + """List Kubernetes events (warnings, errors, state changes).""" + filtered = EVENTS + if namespace: + filtered = [e for e in filtered if e[0] == namespace] + if not filtered: + return "No events found." + headers = ["NAMESPACE", "LAST SEEN", "TYPE", "REASON", "OBJECT", "MESSAGE"] + rows = [] + for i, (ns, etype, reason, obj, msg) in enumerate(filtered): + last_seen = f"{(i + 1) * 5}m" + rows.append([ns, last_seen, etype, reason, obj, msg]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_top( + name: Optional[str] = None, + label_selector: Optional[str] = None, +) -> str: + """List node resource consumption (CPU and memory) from the Metrics Server.""" + nodes = NODES + if name: + nodes = [n for n in nodes if n["name"] == name] + if label_selector: + all_nodes = [_build_node(n) for n in nodes] + matched = [n for n, r in zip(nodes, all_nodes) + if _match_labels(r["metadata"]["labels"], label_selector)] + nodes = matched + if not nodes: + return "No metrics available for the requested node(s)." + + headers = ["NAME", "CPU(cores)", "CPU%", "MEMORY(bytes)", "MEMORY%"] + rows = [] + for n in nodes: + cpu_pct = round(n["cpu_use"] / n["cpu_cap"] * 100) + mem_pct = round(n["mem_use"] / n["mem_cap"] * 100) + rows.append([n["name"], f"{n['cpu_use']}m", f"{cpu_pct}%", + f"{n['mem_use']}Mi", f"{mem_pct}%"]) + return _table(headers, rows) + + +@mcp.tool() +def nodes_stats_summary(name: str) -> str: + """Get detailed resource usage statistics from a node via the kubelet Summary API.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + + cpu_nano = node["cpu_use"] * 1_000_000 + mem_bytes = node["mem_use"] * 1024 * 1024 + mem_avail = (node["mem_cap"] - node["mem_use"]) * 1024 * 1024 + + vm_pods = [vm for vm in VMS + if vm["node"] == name and vm["status"] in ("Running", "Paused")] + pod_stats = [] + for vm in vm_pods: + pod_stats.append({ + "podRef": {"name": f"virt-launcher-{vm['name']}-{_pod_hash(vm['name'])}", + "namespace": vm["ns"]}, + "cpu": {"usageNanoCores": vm["cpu"] * 250_000_000}, + "memory": {"usageBytes": vm["mem"] * 512 * 1024 * 1024, + "workingSetBytes": vm["mem"] * 400 * 1024 * 1024}, + }) + + summary = { + "node": { + "nodeName": name, + "cpu": {"usageNanoCores": cpu_nano, + "usageCoreNanoSeconds": cpu_nano * 3600}, + "memory": {"availableBytes": mem_avail, + "usageBytes": mem_bytes, + "workingSetBytes": int(mem_bytes * 0.95)}, + "fs": {"availableBytes": 200_000_000_000, + "capacityBytes": 500_000_000_000, + "usedBytes": 300_000_000_000}, + "network": { + "interfaces": [{ + "name": "eth0", + "rxBytes": 1_500_000_000_000, + "txBytes": 800_000_000_000, + }], + }, + }, + "pods": pod_stats, + } + return json.dumps(summary, indent=2) + + +@mcp.tool() +def nodes_log(name: str, query: str, tailLines: int = 100) -> str: + """Get logs from a Kubernetes node.""" + node = next((n for n in NODES if n["name"] == name), None) + if not node: + return f'Error: node "{name}" not found' + return (f"-- Logs begin for {name} ({query}) --\n" + f"Mar 02 12:00:00 {name} kubelet[1234]: I0302 12:00:00.000000 " + f"node_status.go:123] Node {name} status: Ready\n" + f"-- End of logs --") + + +# ═══════════════════════════════════════════════════════════════════════════ +# CORE TOOLSET: PODS +# ═══════════════════════════════════════════════════════════════════════════ + +def _pod_list_filtered(namespace=None, fieldSelector=None, labelSelector=None): + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + if namespace: + pods = _filter_by_ns(pods, namespace) + if labelSelector: + pods = [p for p in pods + if _match_labels(p["metadata"]["labels"], labelSelector)] + return pods + + +@mcp.tool() +def pods_list( + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the cluster from all namespaces.""" + pods = _pod_list_filtered(None, fieldSelector, labelSelector) + if not pods: + return "No pods found." + headers = ["NAMESPACE", "NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["namespace"], p["metadata"]["name"], + "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_list_in_namespace( + namespace: str, + fieldSelector: Optional[str] = None, + labelSelector: Optional[str] = None, +) -> str: + """List all pods in the specified namespace.""" + pods = _pod_list_filtered(namespace, fieldSelector, labelSelector) + if not pods: + return f'No pods found in namespace "{namespace}".' + headers = ["NAME", "READY", "STATUS", "RESTARTS", "AGE"] + rows = [[p["metadata"]["name"], "1/1", "Running", "0", "30d"] for p in pods] + return _table(headers, rows) + + +@mcp.tool() +def pods_get(name: str, namespace: Optional[str] = None) -> str: + """Get a Pod by name, returned as YAML.""" + pods = [_build_pod(vm) for vm in VMS if vm["status"] in ("Running", "Paused")] + for p in pods: + if p["metadata"]["name"] == name: + if namespace and p["metadata"]["namespace"] != namespace: + continue + return _to_yaml(p) + return f'Error from server (NotFound): pods "{name}" not found' + + +@mcp.tool() +def pods_delete(name: str, namespace: Optional[str] = None) -> str: + """Delete a Pod by name.""" + return f'pod "{name}" deleted' + + +@mcp.tool() +def pods_log( + name: str, + namespace: Optional[str] = None, + container: Optional[str] = None, + tail: int = 100, + previous: bool = False, +) -> str: + """Get the logs of a Pod.""" + vm_name = name.replace("virt-launcher-", "").rsplit("-", 1)[0] + vm = next((v for v in VMS if v["name"] == vm_name), None) + if not vm: + return f'Error from server (NotFound): pods "{name}" not found' + return ( + f'{{"component":"virt-launcher","level":"info","msg":"Configured with ' + f'VM {vm["name"]}","timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-launcher","level":"info","msg":"Domain started",' + f'"timestamp":"{CREATED}"}}\n' + f'{{"component":"virt-handler","level":"info","msg":"VM is running on ' + f'node {vm["node"]}","timestamp":"{CREATED}"}}' + ) + + +@mcp.tool() +def pods_exec( + name: str, + command: list, + namespace: Optional[str] = None, + container: Optional[str] = None, +) -> str: + """Execute a command in a Pod.""" + cmd = " ".join(command) + return f"command '{cmd}' executed successfully" + + +@mcp.tool() +def pods_run( + image: str, + name: Optional[str] = None, + namespace: Optional[str] = None, + port: Optional[int] = None, +) -> str: + """Run a Pod with the provided container image.""" + pod_name = name or "run-" + _pod_hash(image) + return f'pod/{pod_name} created' + + +@mcp.tool() +def pods_top( + name: Optional[str] = None, + namespace: Optional[str] = None, + all_namespaces: bool = False, + label_selector: Optional[str] = None, +) -> str: + """List pod resource consumption from the Metrics Server.""" + pods_data = [(vm, _build_pod(vm)) for vm in VMS + if vm["status"] in ("Running", "Paused")] + if namespace and not all_namespaces: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["namespace"] == namespace] + if name: + pods_data = [(vm, p) for vm, p in pods_data + if p["metadata"]["name"] == name] + + if not pods_data: + return "No metrics available." + + show_ns = all_namespaces or (namespace is None and name is None) + headers = (["NAMESPACE"] if show_ns else []) + ["NAME", "CPU(cores)", "MEMORY(bytes)"] + rows = [] + for vm, p in pods_data: + cpu_m = f"{vm['cpu'] * 250}m" + mem_mi = f"{vm['mem'] * 512}Mi" + row = ([p["metadata"]["namespace"]] if show_ns else []) + \ + [p["metadata"]["name"], cpu_m, mem_mi] + rows.append(row) + return _table(headers, rows) + + +# ═══════════════════════════════════════════════════════════════════════════ +# KUBEVIRT TOOLSET +# ═══════════════════════════════════════════════════════════════════════════ + +@mcp.tool() +def vm_lifecycle(name: str, namespace: str, action: str) -> str: + """Manage VirtualMachine lifecycle: start, stop, or restart a VM.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + if action not in ("start", "stop", "restart"): + return f'Error: invalid action "{action}". Must be start, stop, or restart' + return f'VirtualMachine "{name}" was scheduled to {action}' + + +@mcp.tool() +def vm_create( + name: str, + namespace: str, + workload: str = "fedora", + autostart: bool = False, + instancetype: Optional[str] = None, + preference: Optional[str] = None, + size: Optional[str] = None, + storage: Optional[str] = None, + performance: Optional[str] = None, + networks: Optional[list] = None, +) -> str: + """Create a VirtualMachine in the cluster.""" + return f'VirtualMachine "{name}" created in namespace "{namespace}"' + + +@mcp.tool() +def vm_clone(name: str, namespace: str, targetName: str) -> str: + """Clone a KubeVirt VirtualMachine.""" + vm = next((v for v in VMS if v["name"] == name and v["ns"] == namespace), None) + if not vm: + return (f'Error from server (NotFound): virtualmachines.kubevirt.io ' + f'"{name}" not found in namespace "{namespace}"') + return f'VirtualMachineClone "{name}-to-{targetName}" created' + + +# ═══════════════════════════════════════════════════════════════════════════ + +if __name__ == "__main__": + mcp.run() diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-restore/instruction.md b/evaluation/without_skills/rh-virt__vm-snapshot-restore/instruction.md new file mode 100644 index 00000000..d28e79fd --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-restore/instruction.md @@ -0,0 +1,12 @@ +# VM Snapshot Restore Task + +You are an OpenShift Virtualization administrator. Restore VM `production-db` from snapshot `production-db-backup-20240301` in namespace `prod-vms`. + +## Requirements +- Verify snapshot is ready and valid +- Address VM state requirements for restore +- Include safeguards (this is a destructive operation) + +Use MCP tools to examine the cluster. Document your methodology, findings, and restore plan in `/root/report.md`. + +If reference documentation or skills are available in this environment, consult them before beginning work. diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-restore/solution/solve.sh b/evaluation/without_skills/rh-virt__vm-snapshot-restore/solution/solve.sh new file mode 100644 index 00000000..7bb6e343 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-restore/solution/solve.sh @@ -0,0 +1,39 @@ +#!/bin/bash +cat > /root/report.md << 'REPORT_EOF' +# VM Snapshot Restore Plan + +## Restore production-db from production-db-backup-20240301 + +### Prerequisites +1. Verify snapshot exists and `status.phase == "Succeeded"` and `status.readyToUse == true` +2. **VM must be stopped** before restore — use `vm_lifecycle` action=stop +3. Verify no active VirtualMachineRestore in progress + +### VirtualMachineRestore YAML +```yaml +apiVersion: snapshot.kubevirt.io/v1beta1 +kind: VirtualMachineRestore +metadata: + name: restore-production-db-20240301 + namespace: prod-vms +spec: + target: + apiGroup: kubevirt.io + kind: VirtualMachine + name: production-db + virtualMachineSnapshotName: production-db-backup-20240301 +``` + +### Procedure +1. Stop VM production-db +2. Verify snapshot is ready (readyToUse: true) +3. **Typed confirmation**: Type snapshot name for safety +4. Create VirtualMachineRestore resource +5. Monitor restore progress (poll status.phase) +6. Start VM after restore completes + +### Warning +- Restore **overwrites** current VM state with snapshot state +- All changes since snapshot will be lost + +REPORT_EOF diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-restore/task.toml b/evaluation/without_skills/rh-virt__vm-snapshot-restore/task.toml new file mode 100644 index 00000000..bf15ebed --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-restore/task.toml @@ -0,0 +1,26 @@ +version = "1.0" + +[metadata] +id = "rh-virt__vm-snapshot-restore" +name = "rh-virt VM Snapshot Restore Skill Evaluation" +difficulty = "medium" +category = "per-skill-eval" +tags = ["rh-virt", "vm-snapshot-restore", "per-skill-eval"] + +[verifier] +timeout_sec = 900.0 + +[verifier.env] +ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL}" +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +LLM_JUDGE_MODEL = "claude-sonnet-4-5" + +[agent] +timeout_sec = 900.0 + +[environment] +build_timeout_sec = 600.0 +cpus = 2 +gpus = 0 +memory = "5.5G" +storage = "10G" diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-restore/tests/llm_judge.py b/evaluation/without_skills/rh-virt__vm-snapshot-restore/tests/llm_judge.py new file mode 100644 index 00000000..0a348593 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-restore/tests/llm_judge.py @@ -0,0 +1,93 @@ +import json +import os +import sys +import time +from pathlib import Path + +try: + from anthropic import Anthropic +except ImportError: + print("ERROR: anthropic package not installed") + sys.exit(1) + +CRITERIA = [ + {"id": "vm_stopped_prerequisite", "file": "/root/report.md", "question": "Does the report require the VM to be stopped before restore and explain this is enforced by the platform?", "reference": "A skilled report enforces stop-before-restore. An unskilled report doesn't mention this prerequisite."}, + {"id": "restore_cr_definition", "file": "/root/report.md", "question": "Does the report define a VirtualMachineRestore CR with virtualMachineSnapshotName reference?", "reference": "A skilled report creates proper VirtualMachineRestore resource. An unskilled report doesn't know the restore API."}, + {"id": "destructive_warning_and_verification", "file": "/root/report.md", "question": "Does the report warn about data loss (changes since snapshot) and verify restore completion via status.complete?", "reference": "A skilled report warns about destructive nature and verifies completion. An unskilled report restores without warnings."} +] + +SYSTEM_PROMPT = ( + "You are an evaluator for a cloud operations benchmark. You will be given a " + "file produced by an AI agent, a yes/no question about its contents, and a " + "REFERENCE ANSWER that describes what a correct, skilled response looks like.\n\n" + "Rules:\n" + "- Answer ONLY with a JSON object: {\"pass\": true} or {\"pass\": false}\n" + "- Base your answer strictly on what is written in the file content\n" + "- Do not infer or assume knowledge the agent did not demonstrate\n" + "- Use the REFERENCE ANSWER to calibrate what counts as a pass\n" + "- Accept different phrasings that convey the SAME concept\n" + "- Do NOT use your own general knowledge to fill gaps" +) + + +def judge_criterion(client, model, criterion): + filepath = criterion["file"] + if not Path(filepath).exists(): + return {"id": criterion["id"], "pass": False, "reason": "file not found"} + content = Path(filepath).read_text() + if len(content) > 50000: + content = content[:50000] + "\n... (truncated)" + reference = criterion.get("reference", "") + ref_block = f"\n\n## Reference Answer\n{reference}" if reference else "" + max_retries = 3 + for attempt in range(max_retries): + try: + response = client.messages.create( + model=model, max_tokens=64, system=SYSTEM_PROMPT, + messages=[{"role": "user", "content": ( + f"## File: {filepath}\n\n```\n{content}\n```\n\n" + f"## Question\n{criterion['question']}{ref_block}" + )}], + ) + text = response.content[0].text.strip() + if "{" in text: + text = text[text.index("{"):text.rindex("}") + 1] + result = json.loads(text) + return {"id": criterion["id"], "pass": bool(result.get("pass", False))} + except Exception as e: + if attempt < max_retries - 1: + time.sleep(5 * (attempt + 1)) + else: + return {"id": criterion["id"], "pass": False, "reason": str(e)} + + +def main(): + api_key = os.getenv("ANTHROPIC_API_KEY") + base_url = os.getenv("ANTHROPIC_BASE_URL") + model = os.getenv("LLM_JUDGE_MODEL", "claude-haiku-4-5") + if not api_key: + print("ERROR: ANTHROPIC_API_KEY not set, skipping LLM judge") + json.dump({"criteria": [], "passed": 0, "total": 0, "score": 0.0}, + open("/logs/verifier/llm_judge.json", "w"), indent=2) + return + client_kwargs = {"api_key": api_key} + if base_url: + client_kwargs["base_url"] = base_url + client = Anthropic(**client_kwargs) + results = [] + print(f"=== LLM Judge: evaluating {len(CRITERIA)} criteria with {model} ===") + for criterion in CRITERIA: + print(f" Evaluating: {criterion['id']} ...", end=" ", flush=True) + result = judge_criterion(client, model, criterion) + results.append(result) + print("PASS" if result["pass"] else "FAIL") + passed = sum(1 for r in results if r["pass"]) + total = len(results) + score = round(passed / total, 4) if total > 0 else 0.0 + print(f"=== LLM Judge: {passed}/{total} criteria passed (score={score}) ===") + Path("/logs/verifier/llm_judge.json").write_text(json.dumps( + {"criteria": results, "passed": passed, "total": total, "score": score}, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-restore/tests/test.sh b/evaluation/without_skills/rh-virt__vm-snapshot-restore/tests/test.sh new file mode 100644 index 00000000..fb1242b7 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-restore/tests/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +pip3 install --break-system-packages \ + pytest==8.4.1 \ + pytest-json-ctrf==0.3.5 \ + anthropic>=0.75.0 + +TEST_FILE=$(find / -name "test_outputs.py" 2>/dev/null | head -1) +JUDGE_FILE=$(find / -name "llm_judge.py" 2>/dev/null | head -1) + +if [ -z "$TEST_FILE" ]; then + echo "ERROR: Could not find test_outputs.py" + echo "0" > /logs/verifier/reward.txt + exit 1 +fi + +echo "=== Files created by agent in /root ===" +ls -la /root/*.md 2>/dev/null || echo "No markdown files found" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 1: Deterministic Tests (pytest)" +echo "════════════════════════════════════════════" + +pytest "$TEST_FILE" \ + --ctrf=/logs/verifier/ctrf.json \ + -v 2>&1 + +pytest_exit=$? + +pytest_passed=0 +pytest_total=0 +if [ -f /logs/verifier/ctrf.json ]; then + pytest_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['passed'])" 2>/dev/null) + pytest_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/ctrf.json')); print(d['results']['summary']['tests'])" 2>/dev/null) +fi +echo "=== Pytest: ${pytest_passed}/${pytest_total} passed ===" + +echo "" +echo "════════════════════════════════════════════" +echo " Phase 2: LLM Judge (skill evaluation)" +echo "════════════════════════════════════════════" + +llm_passed=0 +llm_total=0 + +if [ -n "$JUDGE_FILE" ] && [ -n "$ANTHROPIC_API_KEY" ]; then + timeout 180 python3 "$JUDGE_FILE" + + if [ -f /logs/verifier/llm_judge.json ]; then + llm_passed=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['passed'])" 2>/dev/null) + llm_total=$(python3 -c "import json; d=json.load(open('/logs/verifier/llm_judge.json')); print(d['total'])" 2>/dev/null) + fi + echo "=== LLM Judge: ${llm_passed}/${llm_total} passed ===" +else + if [ -z "$JUDGE_FILE" ]; then + echo "WARNING: llm_judge.py not found, skipping LLM evaluation" + fi + if [ -z "$ANTHROPIC_API_KEY" ]; then + echo "WARNING: ANTHROPIC_API_KEY not set, skipping LLM evaluation" + fi +fi + +echo "" +echo "════════════════════════════════════════════" +echo " Combined Score" +echo "════════════════════════════════════════════" + +reward=$(python3 -c " +pytest_p = int('${pytest_passed}' or 0) +pytest_t = int('${pytest_total}' or 0) +llm_p = int('${llm_passed}' or 0) +llm_t = int('${llm_total}' or 0) +total_p = pytest_p + llm_p +total_t = pytest_t + llm_t +reward = round(total_p / total_t, 4) if total_t > 0 else 0.0 +print(reward) +" 2>/dev/null) + +echo "$reward" > /logs/verifier/reward.txt +echo "=== Final Reward: $reward (pytest=${pytest_passed}/${pytest_total} + llm=${llm_passed}/${llm_total}) ===" + +cp /root/*.md /logs/verifier/ 2>/dev/null || true + +exit 0 diff --git a/evaluation/without_skills/rh-virt__vm-snapshot-restore/tests/test_outputs.py b/evaluation/without_skills/rh-virt__vm-snapshot-restore/tests/test_outputs.py new file mode 100644 index 00000000..e02b5cf9 --- /dev/null +++ b/evaluation/without_skills/rh-virt__vm-snapshot-restore/tests/test_outputs.py @@ -0,0 +1,71 @@ +""" +Tests for rh-virt__vm-snapshot-restore per-skill evaluation. +Baseline tests: report structure. +Skill-dependent tests: conceptual checks (no exact tool/field name matching). +""" +import os +import pytest + +REPORT = "/root/report.md" + + +def read_report(): + if not os.path.exists(REPORT): + pytest.fail(f"Required file not found: {REPORT}") + with open(REPORT) as f: + return f.read() + +class TestBaseline: + def test_report_exists(self): + assert os.path.exists(REPORT), "report.md must exist" + + def test_mentions_restore(self): + content = read_report().lower() + assert "restor" in content, "report should discuss restore operation" + + def test_mentions_snapshot(self): + content = read_report().lower() + assert "snapshot" in content or "backup" in content, "report should mention the snapshot" + + +class TestSkillDependent: + def test_vm_stopped_prerequisite(self): + """Skill: VM must be stopped before restore; stop-and-restore option.""" + c = read_report().lower() + assert any(t in c for t in ["stop before restor", "must be stopped", "stop-and-restore", "vm must be stopped", "halt"]) and ( + "stop" in c and "restor" in c + ), ( + "should require VM stopped before restore" + ) + + def test_destructive_warning(self): + """Skill: Data loss warning; changes since snapshot will be lost.""" + c = read_report().lower() + assert any(t in c for t in ["data loss", "changes since", "will be lost", "overwrite", "destructive", "replace current", "cannot recover"]), ( + "should warn about data loss from restore" + ) + + def test_restore_cr(self): + """Skill: VirtualMachineRestore CR with target and snapshot reference.""" + c = read_report().lower() + assert "virtualmachinerestore" in c and any(t in c for t in ["target", "virtualmachinesnapshotname", "spec"]), ( + "should define VirtualMachineRestore resource" + ) + + def test_post_restore_verification(self): + """Skill: Verify restore complete; status.complete; start VM after.""" + c = read_report().lower() + assert any(t in c for t in ["status.complete", "restore complete", "post-restore", "after restore", "start vm", "start the vm"]) and ( + "restor" in c or "complete" in c or "start" in c + ), ( + "should include post-restore verification or start step" + ) + + def test_typed_confirmation(self): + """Skill: Typed snapshot name confirmation before restore.""" + c = read_report().lower() + assert any(t in c for t in ["type", "typed", "exact name", "to confirm", "snapshot name"]) and ( + "confirm" in c or "type" in c + ), ( + "should require typed snapshot name confirmation" + )