Skip to content

chore(fix): Fix update model evaluation script#131

Open
mtodor wants to merge 2 commits intomainfrom
mtodor/fix-update-evals-script
Open

chore(fix): Fix update model evaluation script#131
mtodor wants to merge 2 commits intomainfrom
mtodor/fix-update-evals-script

Conversation

@mtodor
Copy link
Copy Markdown
Collaborator

@mtodor mtodor commented May 7, 2026

Description

The weekly model evaluation workflow (example failure) started failing after the mcpchecker bump from v0.0.17 to v0.0.18 (commit 1ae9a26). The new version changed the JSON output format from a bare array ([{...}, ...]) to a wrapper object ({"results": [...], "summary": {...}}), breaking all jq queries in scripts/update-model-evaluation.sh.

This PR:

  • Updates all 5 jq queries in update-model-evaluation.sh to access results via .results[] instead of .[]
  • Updates jq example commands in e2e-tests/README.md to match the new format
  • Adds a validation step to the E2E PR workflow (.github/workflows/e2e.yml) that runs update-model-evaluation.sh against real mcpchecker output, catching format regressions before they reach the weekly workflow

Validation

  • E2E tests pass in CI with the new validation step
  • Checked all jq queries with new version of mcpchecker output

AI-assisted development prompts

This PR was developed with AI assistance. Below are the prompts used:

Prompt: "I have noticed that weekly module evaluation update is failing. Here's one job example: https://github.com/stackrox/stackrox-mcp/actions/runs/25362392033 - It could be related to update of mcpchecker. Investigate the problem. You can download artifacts from GitHub from that workflow and run locally E2E tests so that you can see output of new version of with new version of MCP checker and then compare results and create a plan what is supposed to be done. Please put everything in beads and output plan for review."
Response: Downloaded CI artifacts, explored the codebase, identified that mcpchecker v0.0.18 changed JSON output from a bare array to {"results": [...], "summary": {...}}. Ran E2E tests locally (11/11 pass) and reproduced the jq error. Created a plan to fix all 5 jq queries in update-model-evaluation.sh and 2 jq examples in e2e-tests/README.md.

Prompt: "File 'mcpchecker-stackrox-mcp-e2e-out.json' should not be committed. It's ignored in .gitignore. (step 3)"
Response: Removed step 3 from the plan — verified the file is gitignored via .gitignore:32: /e2e-tests/**/mcpchecker. Plan reduced to 2 file changes.

Prompt: "We have a few smoke tests. Would be possible to add these jq queries in a smoke test so that we can ensure that we are not breaking some scripts."
Response: Explored existing smoke test infrastructure (shell-based smoke tests, Go smoke tests, WireMock tests). Proposed three options for where to place the test.

Prompt: (Selected: "New script scripts/smoke-test-eval-script.sh")
Response: Designed a plan for a standalone shell script with a minimal JSON fixture, Makefile target, and CI wiring in test.yml.

Prompt: "Actually I don't like this plan. Let's do something else. The problem that I see is that we need to add fixtures to test jq, but actually we are more interested that new version of MCP checker does not introduce regression and make our defined queries obsolete. How about adding step to already existing E2E tests where we have MCP checker output to check after output is created that its format is not breaking other scripts. In file: '.github/workflows/e2e.yml' - What is your thought about this proposal? Make a critical analysis of this proposal."
Response: Provided a critical analysis with strengths (tests real output, no fixtures to maintain, no extra cost) and considerations (step ordering, doc file revert needed). Recommended the approach as superior to fixtures.

Prompt: (Selected: "Just the script validation")
Response: Revised the plan to add a single validation step to e2e.yml that runs update-model-evaluation.sh against real mcpchecker output and reverts the doc file afterward.

Prompt: "This looks good. The only thing that I would like to change is to add comments to YAML file that provide information what we are achieving there and why we are doing checkout so that it's clear for our reviewer or another developer what's going on."
Response: Added descriptive YAML comments explaining the purpose of the validation step and why git checkout reverts the doc file.

@mtodor mtodor requested a review from janisz as a code owner May 7, 2026 17:04
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
361 2 359 12
View the full list of 2 ❄️ flaky test(s)
::policy 1

Flake rate in main: 100.00% (Passed 0 times, Failed 36 times)

Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3
::policy 4

Flake rate in main: 100.00% (Passed 0 times, Failed 36 times)

Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

E2E Test Results

Commit: 4aa5839
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ list-clusters (assertions: 3/3)

Tasks:      11/11 passed (100.00%)
Assertions: 32/32 passed (100.00%)
Tokens:     ~52804 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  13838 tokens
  Output: 21302 tokens
Judge used tokens:
  Input:  66139 tokens
  Output: 57725 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants