chore(fix): Fix update model evaluation script by mtodor · Pull Request #131 · stackrox/stackrox-mcp

mtodor · 2026-05-07T17:04:08Z

Description

The weekly model evaluation workflow (example failure) started failing after the mcpchecker bump from v0.0.17 to v0.0.18 (commit 1ae9a26). The new version changed the JSON output format from a bare array ([{...}, ...]) to a wrapper object ({"results": [...], "summary": {...}}), breaking all jq queries in scripts/update-model-evaluation.sh.

This PR:

Updates all 5 jq queries in update-model-evaluation.sh to access results via .results[] instead of .[]
Updates jq example commands in e2e-tests/README.md to match the new format
Adds a validation step to the E2E PR workflow (.github/workflows/e2e.yml) that runs update-model-evaluation.sh against real mcpchecker output, catching format regressions before they reach the weekly workflow

Validation

E2E tests pass in CI with the new validation step
Checked all jq queries with new version of mcpchecker output

AI-assisted development prompts

This PR was developed with AI assistance. Below are the prompts used:

Prompt: "I have noticed that weekly module evaluation update is failing. Here's one job example: https://github.com/stackrox/stackrox-mcp/actions/runs/25362392033 - It could be related to update of mcpchecker. Investigate the problem. You can download artifacts from GitHub from that workflow and run locally E2E tests so that you can see output of new version of with new version of MCP checker and then compare results and create a plan what is supposed to be done. Please put everything in beads and output plan for review."
Response: Downloaded CI artifacts, explored the codebase, identified that mcpchecker v0.0.18 changed JSON output from a bare array to {"results": [...], "summary": {...}}. Ran E2E tests locally (11/11 pass) and reproduced the jq error. Created a plan to fix all 5 jq queries in update-model-evaluation.sh and 2 jq examples in e2e-tests/README.md.

Prompt: "File 'mcpchecker-stackrox-mcp-e2e-out.json' should not be committed. It's ignored in .gitignore. (step 3)"
Response: Removed step 3 from the plan — verified the file is gitignored via .gitignore:32: /e2e-tests/**/mcpchecker. Plan reduced to 2 file changes.

Prompt: "We have a few smoke tests. Would be possible to add these jq queries in a smoke test so that we can ensure that we are not breaking some scripts."
Response: Explored existing smoke test infrastructure (shell-based smoke tests, Go smoke tests, WireMock tests). Proposed three options for where to place the test.

Prompt: (Selected: "New script scripts/smoke-test-eval-script.sh")
Response: Designed a plan for a standalone shell script with a minimal JSON fixture, Makefile target, and CI wiring in test.yml.

Prompt: "Actually I don't like this plan. Let's do something else. The problem that I see is that we need to add fixtures to test jq, but actually we are more interested that new version of MCP checker does not introduce regression and make our defined queries obsolete. How about adding step to already existing E2E tests where we have MCP checker output to check after output is created that its format is not breaking other scripts. In file: '.github/workflows/e2e.yml' - What is your thought about this proposal? Make a critical analysis of this proposal."
Response: Provided a critical analysis with strengths (tests real output, no fixtures to maintain, no extra cost) and considerations (step ordering, doc file revert needed). Recommended the approach as superior to fixtures.

Prompt: (Selected: "Just the script validation")
Response: Revised the plan to add a single validation step to e2e.yml that runs update-model-evaluation.sh against real mcpchecker output and reverts the doc file afterward.

Prompt: "This looks good. The only thing that I would like to change is to add comments to YAML file that provide information what we are achieving there and why we are doing checkout so that it's clear for our reviewer or another developer what's going on."
Response: Added descriptive YAML comments explaining the purpose of the validation step and why git checkout reverts the doc file.

codecov-commenter · 2026-05-07T17:08:20Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
361	2	359	12

View the full list of 2 ❄️ flaky test(s)

::policy 1
Flake rate in main: 100.00% (Passed 0 times, Failed 36 times)
Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3

::policy 4
Flake rate in main: 100.00% (Passed 0 times, Failed 36 times)
Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

github-actions · 2026-05-07T17:17:35Z

E2E Test Results

Commit: 4aa5839
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ list-clusters (assertions: 3/3)

Tasks:      11/11 passed (100.00%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~52804 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  13838 tokens
  Output: 21302 tokens
Judge used tokens:
  Input:  66139 tokens
  Output: 57725 tokens

mtodor added 2 commits May 7, 2026 17:59

Fix usage of mcpchecker results

dbd30fd

Add validation for update model evals script

4aa5839

mtodor requested a review from janisz as a code owner May 7, 2026 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(fix): Fix update model evaluation script#131

chore(fix): Fix update model evaluation script#131
mtodor wants to merge 2 commits intomainfrom
mtodor/fix-update-evals-script

mtodor commented May 7, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 7, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mtodor commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Validation

AI-assisted development prompts

Uh oh!

codecov-commenter commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

github-actions Bot commented May 7, 2026

E2E Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mtodor commented May 7, 2026 •

edited

Loading

codecov-commenter commented May 7, 2026 •

edited

Loading