Skip to content

[SDK] CD runs#1541

Merged
chelojimenez merged 29 commits intomainfrom
sdk-ci-cd
Mar 9, 2026
Merged

[SDK] CD runs#1541
chelojimenez merged 29 commits intomainfrom
sdk-ci-cd

Conversation

@chelojimenez
Copy link
Copy Markdown
Contributor

@chelojimenez chelojimenez commented Mar 6, 2026

Note

Medium Risk
Moderate UI and data-display expansion plus changes to eval result derivation (honoring resultSource: "reported") and widget replay/security defaults; regressions could affect pass/fail reporting or widget rendering in the Inspector.

Overview
Updates SDK docs to cover MCPJam eval auto-reporting + manual reporting APIs, adds a new Eval Reporting reference page, and refreshes EvalTest/EvalSuite docs to reflect updated configs, return types, run options, and MCPJam upload behavior.

Adds a new Inspector CI Evals tab (hash-routed) for viewing SDK-ingested eval suites/runs in a workspace, including CI metadata display, read-only run/test drilldowns, and better iteration/trace UX (formatted/raw tool calls, model resolution, and widget snapshot replay via new WidgetReplay + trace adapter). Several chat/widget components now support a non-interactive mode, persisted render overrides (CSP/permissions/permissive/border), and cached-widget replay hardening (force permissive for cached HTML). Also adds “Copy markdown for server evals” from the server connection menu and threads workspace info into Settings.

Written by Cursor Bugbot for commit 2b3e770. This will update automatically on new commits. Configure here.

@chelojimenez
Copy link
Copy Markdown
Contributor Author

chelojimenez commented Mar 6, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@chelojimenez chelojimenez marked this pull request as ready for review March 9, 2026 01:44
@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 9, 2026
Comment thread mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx Outdated
Comment thread mcpjam-inspector/client/src/components/evals/iteration-details.tsx
@dosubot dosubot Bot added the documentation Improvements or additions to documentation label Mar 9, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 9, 2026

Walkthrough

This change adds a full eval-reporting system to the SDK (reportEvalResults, reportEvalResultsSafely, createEvalRunReporter, uploadEvalArtifact), new eval-reporting types and artifact parsers (JUnit, Jest, Vitest), widget snapshot capture and OpenAI-compat injection, and EvalAgent/PromptOptions types. SDK exports and mapping utilities were extended. Docs and skill reference were added. The inspector UI receives a CI Evals feature set (new routes, tabs, components, adapters, and trace/widget replay improvements), plus numerous UI, hook, and test updates to surface CI metadata and workspace-scoped API keys.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (13)
sdk/src/mcp-client-manager/MCPClientManager.ts-431-440 (1)

431-440: ⚠️ Potential issue | 🟡 Minor

Consider deep cloning or readonly type for getToolMetadata() to prevent accidental cache mutation.

The spread operator only clones the first level. Nested objects and arrays inside _meta remain shared with toolsMetadataCache, so a caller mutating the returned value could inadvertently taint cached data. While current usage is read-only, a deep clone or readonly return type would be more defensive.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/mcp-client-manager/MCPClientManager.ts` around lines 431 - 440,
getToolMetadata currently returns a shallow copy via spread which leaves nested
objects/arrays aliased to toolsMetadataCache; change getToolMetadata to return a
deep-cloned or readonly structure to prevent accidental mutation of the cache:
update the implementation in getToolMetadata to deep clone the metadata (e.g.,
use structuredClone(metadata) with a safe fallback like
JSON.parse(JSON.stringify(metadata)) or lodash/cloneDeep) and optionally change
the signature to return Readonly<Record<string, unknown>> so callers cannot
mutate the result; ensure metadataMap?.get(toolName) is passed through the
deep-clone/readonly conversion before returning.
docs/sdk/reference/eval-reporting.mdx-49-50 (1)

49-50: ⚠️ Potential issue | 🟡 Minor

The sample output misidentifies runId

abc123 in this example is the commit SHA from Line 45, not the created output.runId. Unless externalRunId is set explicitly, this comment is inaccurate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/sdk/reference/eval-reporting.mdx` around lines 49 - 50, Update the
inline comment after the console.log sample so it doesn't claim `abc123` is the
created `output.runId`; clarify that `abc123` in the example is the commit SHA
shown earlier (not the generated runId) unless an `externalRunId` was explicitly
provided. Adjust the comment next to the console.log(`Run ${output.runId}:
${output.result}`) line to state that the displayed identifier will be the
commit SHA from the example above unless `externalRunId` is set, and leave
references to `output.runId` and `externalRunId` as-is.
docs/sdk/reference/eval-reporting.mdx-11-15 (1)

11-15: ⚠️ Potential issue | 🟡 Minor

MCPJAM_API_KEY is not always required

Line 13 says the env var is mandatory, but Lines 259-260 later document apiKey as a direct input override. That contradiction will confuse SDK users about whether they can authenticate purely through code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/sdk/reference/eval-reporting.mdx` around lines 11 - 15, The table entry
incorrectly marks MCPJAM_API_KEY as required; update the table so the
`MCPJAM_API_KEY` variable is listed as optional (Required = No or Optional) and
adjust its Description to state that authentication can alternatively be
provided via the SDK input `apiKey` (documented later), ensuring consistency
with the `apiKey` override mentioned in the `apiKey` documentation; also scan
for and reconcile any other places that claim the env var is mandatory to avoid
conflicting guidance.
mcpjam-inspector/client/src/components/evals/test-cases-overview.tsx-149-149 (1)

149-149: ⚠️ Potential issue | 🟡 Minor

Single-model suites now render a false empty state

When modelStats.length === 1, this branch suppresses the chart and the component falls through to “No model data available.” That hides valid data for the common one-model case.

Suggested fix
-            {modelStats.length > 1 ? (
+            {modelStats.length > 0 ? (
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/evals/test-cases-overview.tsx` at line
149, The JSX branch currently only renders the chart when modelStats.length > 1,
which hides valid data for single-model suites; update the condition in the
test-cases-overview component so the chart renders for any non-empty modelStats
(e.g., modelStats.length > 0 or modelStats.length >= 1) instead of > 1, ensuring
the chart-rendering JSX (the branch using modelStats) displays when there is
exactly one model.
mcpjam-inspector/client/src/components/evals/ci-metadata-display.tsx-47-57 (1)

47-57: ⚠️ Potential issue | 🟡 Minor

Keep branch and commit badges plain when you can't derive repo URLs.

Falling back to runUrl makes those badges open the pipeline page for non-GitHub providers. That is misleading; only the Pipeline link should use runUrl.

🔧 Suggested fix
-  const branchUrl = branch
-    ? repoBaseUrl
-      ? `${repoBaseUrl}/tree/${encodeURIComponent(branch)}`
-      : runUrl
-    : undefined;
-  const commitUrl = shortSha
-    ? repoBaseUrl
-      ? `${repoBaseUrl}/commit/${encodeURIComponent(fullSha ?? shortSha)}`
-      : runUrl
-    : undefined;
+  const branchUrl = branch
+    ? repoBaseUrl
+      ? `${repoBaseUrl}/tree/${encodeURIComponent(branch)}`
+      : undefined
+    : undefined;
+  const commitUrl = fullSha
+    ? repoBaseUrl
+      ? `${repoBaseUrl}/commit/${encodeURIComponent(fullSha)}`
+      : undefined
+    : undefined;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/evals/ci-metadata-display.tsx` around
lines 47 - 57, The branchUrl and commitUrl currently fall back to runUrl when
getGitHubRepoBaseUrl(runUrl) returns falsy, causing branch/commit badges to
erroneously link to the pipeline page; change branchUrl and commitUrl so they
are undefined unless repoBaseUrl is truthy (i.e., remove the runUrl fallback) —
update the logic in the block that computes repoBaseUrl, branchUrl, and
commitUrl (referencing getGitHubRepoBaseUrl, repoBaseUrl, branch, shortSha,
fullSha, runUrl) so only the Pipeline link uses runUrl while branch and commit
badges remain plain when repoBaseUrl cannot be derived.
mcpjam-inspector/client/src/components/evals/ci-suite-detail.tsx-200-205 (1)

200-205: ⚠️ Potential issue | 🟡 Minor

Handle stale testId and runId routes explicitly.

If the selected case or run no longer exists, this branch returns null and leaves the detail pane blank. Redirecting back to suite-overview or showing an empty state would make deleted or stale deep links recoverable.

Also applies to: 284-298

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/evals/ci-suite-detail.tsx` around
lines 200 - 205, When the "test-detail" branch finds no matching selectedCase
(selectedCase is null) you must handle stale/deleted deep links instead of
returning null: detect the missing case and either navigate back to the suite
overview route (e.g. call navigate('/suite-overview') or invoke the existing
setViewMode('suite-overview')) or render a small empty-state component with a
message and a button to go back; apply the same fix for the analogous
"run-detail" branch that checks selectedRun/runId (the code around lines
284-298) so missing runs redirect or show the empty state as well. Ensure you
reference the existing viewMode, selectedTestId/selectedRunId checks and replace
the early `return null` with the redirect/state update or empty-state render.
mcpjam-inspector/client/src/components/CiEvalsTab.tsx-259-274 (1)

259-274: ⚠️ Potential issue | 🟡 Minor

Show loading before the empty state.

While queries.isOverviewLoading is true, sdkSuites is still empty, so the right panel briefly renders “No CI runs yet.” Checking the loading flag first avoids that false empty state.

💡 Minimal fix
-          {sdkSuites.length === 0 ? (
+          {queries.isOverviewLoading ? (
+            <div className="flex h-full items-center justify-center">
+              <div className="text-center">
+                <div className="mx-auto h-8 w-8 animate-spin rounded-full border-b-2 border-primary" />
+                <p className="mt-4 text-muted-foreground">Loading suites...</p>
+              </div>
+            </div>
+          ) : sdkSuites.length === 0 ? (
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/CiEvalsTab.tsx` around lines 259 -
274, The empty-state branch renders when sdkSuites is empty even while data is
loading; in CiEvalsTab change the conditional so you check
queries.isOverviewLoading first and render a loading indicator instead of the
"No CI runs yet" block. Locate the JSX that tests sdkSuites.length === 0 and
move or augment that branch to short-circuit on queries.isOverviewLoading (use
the existing queries.isOverviewLoading flag), otherwise fall through to the
existing sdkSuites/route.type/selectedSuite logic.
mcpjam-inspector/client/src/lib/ci-evals-router.ts-122-124 (1)

122-124: ⚠️ Potential issue | 🟡 Minor

Errant leading slash in replaceState path.

The constructed URL /${hash} resolves to something like //#/ci-evals — an extra / before the hash. This may cause unexpected behavior or navigation issues.

🔧 Proposed fix
  if (options?.replace) {
-   history.replaceState({}, "", `/${hash}`);
+   history.replaceState({}, "", hash);
    window.dispatchEvent(new HashChangeEvent("hashchange"));
  } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/lib/ci-evals-router.ts` around lines 122 - 124,
The replaceState call is building a path with an extra leading slash, causing
URLs like `//#/ci-evals`; change the history.replaceState invocation so it uses
the provided hash string without adding a leading "/" (i.e., pass the raw hash
value rather than `"/" + hash`) and still dispatch the HashChangeEvent as before
(refer to history.replaceState and the subsequent new
HashChangeEvent("hashchange") in the options?.replace block).
sdk/src/McpAppsOpenAICompatibleRuntime.bundled.ts-1-4 (1)

1-4: ⚠️ Potential issue | 🟡 Minor

This file requires manual synchronization with its source.

The SDK copy must be updated whenever the source runtime changes. A build step or automated sync would eliminate drift risk and prevent divergence between the mcpjam-inspector source and SDK copy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/McpAppsOpenAICompatibleRuntime.bundled.ts` around lines 1 - 4, The
SDK copy McpAppsOpenAICompatibleRuntime.bundled.ts is manually synchronized from
mcpjam-inspector/server/routes/apps/mcp-apps/McpAppsOpenAICompatibleRuntime.ts
which causes drift; add an automated sync and verification: implement a
build/prepublish script (e.g., npm script or Makefile target) that copies or
generates McpAppsOpenAICompatibleRuntime.bundled.ts from the source file and a
CI check that compares checksums (or runs the same copy) to fail the build if
out-of-sync, and wire this into the repository CI so changes to the source
automatically update or flag the SDK copy.
sdk/src/eval-result-mapping.ts-145-154 (1)

145-154: ⚠️ Potential issue | 🟡 Minor

Variable testResults shadows the outer parameter.

Line 147 declares const testResults which shadows the function parameter testResults: Map<string, EvalRunResult> on line 140. This compiles but breeds confusion.

Proposed fix
   for (const [testName, testRun] of testResults) {
     const expectedToolCalls = options.expectedToolCallsByTest?.[testName];
-    const testResults = runToEvalResults(testRun, {
+    const mappedResults = runToEvalResults(testRun, {
       casePrefix: `${options.casePrefix}-${testName}`,
       provider: options.provider,
       model: options.model,
       expectedToolCalls,
       promptSelector: options.promptSelector,
     });
-    results.push(...testResults);
+    results.push(...mappedResults);
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/eval-result-mapping.ts` around lines 145 - 154, The inner const
testResults declared inside the for-loop shadows the function parameter
testResults; rename the inner variable (e.g., to caseResults or evalResults)
returned from runToEvalResults to avoid shadowing and confusion, update its
usages in the loop where results.push(...testResults) is called, and ensure
references to runToEvalResults, the loop binding ([testName, testRun]), options,
and the results array are updated accordingly.
mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts-567-569 (1)

567-569: ⚠️ Potential issue | 🟡 Minor

Mutating part.toolCallId may cause side effects.

Line 568 assigns to part.toolCallId on the original content part. If the caller retains references to the trace data, this mutation could leak across invocations.

Proposed fix: avoid mutating input
       const matchedResult = claimMatchingResult(
         resultBuckets,
         toolCallId,
         !part.toolCallId,
       )?.part;

-      if (!part.toolCallId) {
-        part.toolCallId = toolCallId;
-      }
+      const effectiveToolCallId = part.toolCallId ?? toolCallId;

       parts.push(
         ...buildToolParts({
-          toolCall: part,
+          toolCall: { ...part, toolCallId: effectiveToolCallId },
           matchedResult,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts` around
lines 567 - 569, The code mutates the input object by assigning to
part.toolCallId which can leak state; instead avoid mutating the original part
by creating a shallow clone before setting the property (e.g., make a newPart =
{ ...part } or use Object.assign({}, part)), set newPart.toolCallId = toolCallId
only when missing, and use the cloned object in the returned data/array so
callers retain immutable input; update the logic around part and toolCallId to
operate on the clone rather than mutating part directly.
sdk/src/report-eval-results.ts-395-415 (1)

395-415: ⚠️ Potential issue | 🟡 Minor

Tautological condition: config.baseUrl.length >= 0 is always true.

String lengths are never negative. This appears to be vestigial or an incomplete check.

Remove the dead condition
   const bytes = getByteLength(JSON.stringify(body));
-  return bytes <= CHUNK_TARGET_BYTES && config.baseUrl.length >= 0;
+  return bytes <= CHUNK_TARGET_BYTES;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/report-eval-results.ts` around lines 395 - 415, In
shouldUseOneShotUpload the condition(config.baseUrl.length >= 0) is
tautological; remove it or change it to a meaningful check (e.g.,
config.baseUrl.length > 0) so the function actually verifies baseUrl is
non-empty; update the return expression in shouldUseOneShotUpload to only depend
on bytes <= CHUNK_TARGET_BYTES (and optionally config.baseUrl.length > 0) and
remove the dead >= 0 check referencing config.baseUrl.
sdk/src/report-eval-results.ts-161-174 (1)

161-174: ⚠️ Potential issue | 🟡 Minor

Silent fallback to empty object may violate type contract.

When response.json() throws, responseBody becomes undefined, and line 173 returns {} as T. If T has required fields (e.g., StartRunResponse.suiteId), callers will encounter undefined values unexpectedly.

Consider throwing when JSON parsing fails on a successful response, or at minimum logging a warning.

Proposed defensive handling
-      let responseBody: BackendEnvelope<T> | undefined;
-      try {
-        responseBody = (await response.json()) as BackendEnvelope<T>;
-      } catch {
-        responseBody = undefined;
-      }
+      let responseBody: BackendEnvelope<T>;
+      try {
+        responseBody = (await response.json()) as BackendEnvelope<T>;
+      } catch (parseError) {
+        if (response.ok) {
+          console.warn("[mcpjam/sdk] response.ok but JSON parse failed");
+        }
+        responseBody = {} as BackendEnvelope<T>;
+      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/report-eval-results.ts` around lines 161 - 174, The current logic in
report-eval-results.ts silently falls back to returning (responseBody ?? {}) as
T when response.json() fails, which can break callers expecting required fields;
update the handling around response.json() so that if parsing throws and
response.ok is true you throw an explicit Error (including status and parsing
failure context) instead of returning an empty object; adjust the code that
checks responseBody/response.ok and the final return (responseBody ?? {}) as T
to rely on a guaranteed parsed body or raise, referencing the response.json()
call, the responseBody variable, and the response.ok branch to locate where to
change the behavior.
🧹 Nitpick comments (14)
mcpjam-inspector/client/src/components/evals/__tests__/iteration-details-ui.test.tsx (1)

27-41: Unused mock: mockGetBlob is defined but never exercised.

The useAction mock returns mockGetBlob, which gets reset each test, yet neither test invokes nor asserts against it. If blob-fetching behavior isn't under test here, consider removing the mock entirely to reduce noise—or add a test that exercises it.

♻️ Optional: Remove unused mock if not needed
-const { mockGetBlob, mockJsonEditor } = vi.hoisted(() => ({
-  mockGetBlob: vi.fn(),
+const { mockJsonEditor } = vi.hoisted(() => ({
   mockJsonEditor: vi.fn((props: any) => (
     <div data-testid="json-editor">{JSON.stringify(props.value)}</div>
   )),
 }));

And update the convex/react mock:

 vi.mock("convex/react", () => ({
-  useAction: () => mockGetBlob,
+  useAction: () => vi.fn(),
 }));

Then remove the reset in beforeEach:

 beforeEach(() => {
-  mockGetBlob.mockReset();
   mockJsonEditor.mockClear();
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@mcpjam-inspector/client/src/components/evals/__tests__/iteration-details-ui.test.tsx`
around lines 27 - 41, The test file defines and resets mockGetBlob but never
uses it; either remove the unused mock and its reset or exercise it in a test:
delete the vi.mock("convex/react", ...) block (and remove mockGetBlob and its
reset in beforeEach) if blob-fetching isn’t relevant, or keep the mock and add a
test that calls the component/code path that invokes useAction (returning
mockGetBlob) and assert mockGetBlob was called; reference mockGetBlob, the
useAction mock in the "convex/react" vi.mock call, and the beforeEach reset to
locate and update the code.
mcpjam-inspector/client/src/components/evals/__tests__/trace-viewer.test.tsx (1)

9-50: Consider using shared mock presets.

The inline mocks for preferences, provider logos, and UI components work correctly. However, if mcpApiPresets or storePresets in client/src/test/mocks/ offer equivalent utilities, leveraging them would improve consistency across the test suite.

As per coding guidelines: "Use mock presets from client/src/test/mocks/ including mcpApiPresets and storePresets in client tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/evals/__tests__/trace-viewer.test.tsx`
around lines 9 - 50, Tests currently define inline mocks for
usePreferencesStore, getProviderLogo, JsonEditor, and MessageView (which calls
mockMessageView); replace these ad-hoc mocks by importing and applying the
shared mock presets (e.g., storePresets and mcpApiPresets) used across the test
suite so the same mocked preferences/provider/logo/UI behavior is reused; update
the test setup to use the presets before rendering (ensuring
usePreferencesStore, getProviderLogo, JsonEditor, and MessageView behaviors are
provided by the presets) and remove the corresponding inline vi.mock blocks.
mcpjam-inspector/client/src/components/evals/ci-suite-list-sidebar.tsx (1)

98-136: Well-crafted suite list rendering.

The component elegantly handles loading, empty, and populated states. The sparkline visualization with bounded height calculation is a nice touch. Status indicators and relative timestamps create an informative at-a-glance view.

Consider adding aria-current="true" to the selected suite button for enhanced screen reader navigation:

♿ Optional accessibility enhancement
                 <button
                   key={entry.suite._id}
                   onClick={() => onSelectSuite(entry.suite._id)}
+                  aria-current={selectedSuiteId === entry.suite._id ? "true" : undefined}
                   className={cn(
                     "w-full px-4 py-2.5 text-left transition-colors hover:bg-accent/50",
                     selectedSuiteId === entry.suite._id && "bg-accent",
                   )}
                 >
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/evals/ci-suite-list-sidebar.tsx`
around lines 98 - 136, Add an accessibility marker to the selected suite button:
when rendering the button in the CI suite list (the <button> that uses
onSelectSuite and key={entry.suite._id}), include aria-current="true" when
selectedSuiteId === entry.suite._id (and omit or set to false otherwise) so
screen readers can identify the current/active suite; update the conditional
props on that button element accordingly.
sdk/tests/TestAgent.test.ts (1)

397-446: Restore the console.warn spy even on failure.

If this case fails before the last line, console.warn stays mocked and can skew later assertions. A try/finally or afterEach(jest.restoreAllMocks) keeps the suite isolated.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/tests/TestAgent.test.ts` around lines 397 - 446, The test uses
jest.spyOn(console, "warn") to mock warnings but doesn't guarantee restoration
on failure; wrap the spy usage in a try/finally (or install a suite-level
cleanup like afterEach(jest.restoreAllMocks)) so console.warn is always
restored: create the warnSpy via jest.spyOn(console, "warn"), run the test logic
inside try { ... } and call warnSpy.mockRestore() in finally {}, or add
afterEach to call jest.restoreAllMocks(); reference the local warnSpy variable
and the test case "warns and skips widget snapshots when resource reads fail" /
TestAgent instantiation to locate where to add the cleanup.
sdk/tests/eval-run-reporter.test.ts (1)

162-208: Deduplicate the mock run builders.

createMockIterationResult and createMockRunResult are copied verbatim into both blocks. Pulling them to file scope will make future result-shape changes land in one place instead of two.

Also applies to: 240-286

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/tests/eval-run-reporter.test.ts` around lines 162 - 208, The two helper
builders createMockIterationResult and createMockRunResult are duplicated in
multiple describe blocks; move both functions out of the inner describe into
file scope (top-level within the test file) so all tests share the single
implementations, then update/remove the inner copies so tests call the top-level
createMockIterationResult and createMockRunResult; ensure the signatures and
returned types (IterationResult, EvalRunResult) remain unchanged so all usages
compile.
sdk/skills/create-mcp-eval/SKILL.md (2)

139-144: Add language specifier to fenced code block.

The .gitignore additions block lacks a language identifier, which aids syntax highlighting and accessibility tooling.

📝 Suggested fix
 ### .gitignore additions
-```
+```gitignore
 node_modules/
 dist/
 .env
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @sdk/skills/create-mcp-eval/SKILL.md around lines 139 - 144, Update the
fenced code block under the "### .gitignore additions" section so the opening
triple backticks include the language specifier gitignore (i.e., change the
block opener from togitignore) to enable proper syntax highlighting and
accessibility tooling for the listed entries.


</details>

---

`614-620`: **Minor: Emphasis style inconsistency.**

Static analysis prefers underscores for emphasis. The word "unique" uses asterisks.


<details>
<summary>📝 Suggested fix</summary>

```diff
-4. **Write unambiguous prompts for similar tools** — when a server has tools with overlapping descriptions (e.g., `create_view` vs `export_to_excalidraw`), prompts must reference the tool's *unique* action.
+4. **Write unambiguous prompts for similar tools** — when a server has tools with overlapping descriptions (e.g., `create_view` vs `export_to_excalidraw`), prompts must reference the tool's _unique_ action.
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@sdk/skills/create-mcp-eval/SKILL.md` around lines 614 - 620, The emphasis
style is inconsistent: the word "unique" in item 4 uses asterisks instead of the
preferred underscores; update the markdown to use underscores for emphasis
(e.g., change *unique* to _unique_) in the section discussing tool descriptions
(item 4, the sentence that currently says "prompts must reference the tool's
*unique* action") so emphasis style matches the rest of the document.
```

</details>

</blockquote></details>
<details>
<summary>mcpjam-inspector/client/src/lib/ci-evals-router.ts (1)</summary><blockquote>

`151-151`: **Import placed after usage.**

The `React` import resides at the file's end, after all references to `React.useState` and `React.useEffect`. While hoisting ensures correctness, conventional placement at the top aids readability.


<details>
<summary>✨ Proposed relocation</summary>

```diff
+import React from "react";
+
 /**
  * CI Evals Router - Hash-based routing for CI evals tab
```

Then remove line 151.
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/lib/ci-evals-router.ts` at line 151, The React
default import is located after its usages (React.useState, React.useEffect)
which reduces readability; move the line importing React (import React from
"react";) to the top of the file alongside other imports so React is declared
before any references, and remove the trailing duplicate import at the end of
the file (the current import at line with "import React from \"react\";") to
avoid redundancy.
```

</details>

</blockquote></details>
<details>
<summary>sdk/src/eval-result-mapping.ts (1)</summary><blockquote>

`163-214`: **Consider consolidating duplicated aggregation logic.**

`iterationsToEvalResultInputs` replicates the prompt-aggregation pattern from `iterationToEvalResult`. The repeated `flatMap` chains for tool calls, traces, and snapshots could be extracted into a shared helper, reducing surface area for drift.

This is a low-priority refactor—the current implementation is correct.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/eval-result-mapping.ts` around lines 163 - 214, The
iterationsToEvalResultInputs function duplicates the prompt-aggregation logic
found in iterationToEvalResult; refactor by extracting a helper (e.g.,
collectPromptAggregates or aggregatePrompts) that accepts an IterationResult or
prompts array and returns actualToolCalls, traceMessages (or trace object), and
widgetSnapshots; update iterationsToEvalResultInputs and iterationToEvalResult
to call that helper so the flatMap chains for
getToolCalls/getMessages/getWidgetSnapshots are centralized and reused.
```

</details>

</blockquote></details>
<details>
<summary>sdk/src/report-eval-results.ts (4)</summary><blockquote>

`192-203`: **Regex-based retry heuristic is fragile.**

Matching `/network|fetch|timeout|429|5\d\d/i` against `error.message` may trigger on unrelated errors containing these substrings (e.g., an error message like "Failed to fetch network configuration"). Consider tightening the pattern or prioritizing explicit error types.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/report-eval-results.ts` around lines 192 - 203, The retry heuristic
in the shouldRetry logic is too broad; update it to prefer explicit error-type
and status checks and tighten the regex: keep the existing isAbortError check,
then check for known error shapes (e.g., error.name === "FetchError" or (error
as any).type === "system" for node fetch), and check numeric HTTP statuses
explicitly via (error as any).status === 429 || ((error as any).status >= 500 &&
(error as any).status < 600); only if none of those shape checks exist fall back
to a much stricter regex (use word boundaries like /\b(timeout|timeouted|fetch
failed)\b/i or /\b5\d{2}\b/ to avoid accidental matches). Update the shouldRetry
variable construction (the block that sets shouldRetry) and keep the existing
retry loop using config.retryDelaysMs, sleep and jitter unchanged.
```

</details>

---

`471-484`: **Type assertions on server response assume constrained values.**

Casting `start.status as "completed" | "failed"` and `start.result as "passed" | "failed"` trusts the server. If the backend evolves (e.g., adds `"pending"`), this could silently propagate invalid states.

Consider a runtime guard or, at minimum, document the assumption.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/report-eval-results.ts` around lines 471 - 484, The current return
block trusts server values by casting start.status and start.result to tight
literal unions; add a runtime guard instead: implement small validators (e.g.,
isValidStatus(status) and isValidResult(result)) and use them before returning
from the block that reads start (the local variable named start in
report-eval-results.ts) — if the validators pass, return the object with the
validated values, otherwise handle the unexpected case (throw a descriptive
error, log and fall back to a safe default, or mark as "failed") so
invalid/unknown server values are not silently asserted into the narrowed types.
```

</details>

---

`270-326`: **`uploadBlobToConvex` lacks timeout/abort handling.**

Unlike `requestWithRetry`, this function has no `AbortController` timeout. Large uploads or unresponsive servers could hang indefinitely.


<details>
<summary>Add timeout parity with requestWithRetry</summary>

```diff
 async function uploadBlobToConvex(
   config: RuntimeConfig,
   uploadUrl: string,
   body: string,
   contentType: string
 ): Promise<string> {
   let lastError: unknown;

   for (let attempt = 0; attempt <= config.retryDelaysMs.length; attempt++) {
+    const controller = new AbortController();
+    const timeoutHandle = setTimeout(() => controller.abort(), config.timeoutMs);
     try {
       const response = await fetch(uploadUrl, {
         method: "POST",
         headers: {
           "Content-Type": contentType,
         },
         body,
+        signal: controller.signal,
       });
+      clearTimeout(timeoutHandle);
       // ... rest unchanged
     } catch (error) {
+      clearTimeout(timeoutHandle);
       lastError = error;
       // ... existing retry logic
     }
   }
   // ...
 }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/report-eval-results.ts` around lines 270 - 326, The
uploadBlobToConvex function lacks request timeout/abort handling; update
uploadBlobToConvex to use an AbortController (like requestWithRetry) and abort
the fetch after the configured timeout (e.g., config.requestTimeoutMs) so large
uploads or hung servers don't block indefinitely: create an AbortController
before fetch, pass its signal to fetch, set a timer to call controller.abort()
after the timeout (and clear it on success), and ensure abort errors are treated
as retryable in the existing catch logic so behavior matches requestWithRetry.
```

</details>

---

`88-119`: **Chunking logic handles count and byte limits well.**

One edge case: a single `EvalResultInput` exceeding `maxBytes` will still be pushed as its own chunk. If the backend enforces a hard payload limit, this could fail. Consider logging a warning when a single-item chunk exceeds the target.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/report-eval-results.ts` around lines 88 - 119, In
chunkResultsForUpload detect when a single EvalResultInput alone exceeds
maxBytes and emit a warning; specifically, where candidateBytes is computed for
candidate = [...currentChunk, result], if candidateBytes > maxBytes and
currentChunk.length === 0 (meaning this single result already exceeds maxBytes)
call a logger (e.g., console.warn or the module logger) with identifying info
from the result (such as an id or prompt hash) and the computed candidateBytes
and maxBytes so the oversized item is visible for debugging; leave the rest of
the splitting logic intact so the oversized item is still pushed as its own
chunk.
```

</details>

</blockquote></details>
<details>
<summary>sdk/src/eval-reporting-types.ts (1)</summary><blockquote>

`31-36`: **`Record<string, never>` as a presence flag is unconventional.**

This pattern forces each permission to be `{}` when present. While valid, it's less intuitive than a simple `boolean` flag. If backward compatibility permits, consider:

```typescript
camera?: boolean;
microphone?: boolean;
// ...
```

Otherwise, a brief comment clarifying the intent would aid maintainability.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/eval-reporting-types.ts` around lines 31 - 36, The
EvalWidgetPermissions type uses Record<string, never> for camera, microphone,
geolocation, and clipboardWrite which is unconventional; update the type so each
permission is an optional boolean (camera?: boolean; microphone?: boolean;
geolocation?: boolean; clipboardWrite?: boolean) to act as a presence flag, or
if you must keep the current shape for compatibility, add a brief comment above
EvalWidgetPermissions explaining that Record<string, never> is intentionally
used to represent an empty-object presence flag and why it cannot be changed.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx:

  • Around line 130-141: The fallbacks use rawOutput instead of the normalized
    resolvedToolOutput, so when callers only pass toolOutput the widget metadata and
    serverId can be missed; update effectiveToolMeta to call
    readToolResultMeta(resolvedToolOutput) instead of readToolResultMeta(rawOutput),
    update uiType to call detectUIType(effectiveToolMeta, resolvedToolOutput), and
    update the serverId fallback to call readToolResultServerId(resolvedToolOutput)
    so all metadata extraction uses the resolvedToolOutput variable.

In
@mcpjam-inspector/client/src/components/evals/__tests__/trace-viewer-adapter.test.ts:

  • Around line 228-229: The test currently works around a mutation by
    deep-cloning traces in makeTrace but the real fix is to make
    adaptTraceToUiMessages pure and update the test to assert no mutation: change
    adaptTraceToUiMessages so it never writes back into caller-owned objects (create
    new part/message objects when you need to add toolCallId or other synthetic
    fields instead of mutating the input), and then remove the deep-clone in the
    test (makeTrace) and add an assertion that the original TraceEnvelope and its
    parts remain unchanged after calling adaptTraceToUiMessages.

In @sdk/src/artifact-parsers/jest-json.ts:

  • Around line 45-46: The parser currently collapses any non-"passed"
    assertion.status into "failed" (see variables status and passed in
    sdk/src/artifact-parsers/jest-json.ts), which misclassifies
    skipped/pending/todo/disabled tests; update the logic to preserve explicit Jest
    statuses instead of defaulting to "failed": keep status = assertion.status (no
    fallback to "failed"), compute passed as status === "passed", and map other
    statuses (e.g., "skipped","pending","todo","disabled") to a
    non-failed/non-passed outcome (or a dedicated "skipped"/"pending" state) so
    suite pass-rate calculations consume the correct status values. Ensure any
    downstream code that reads passed uses the new status values consistently
    (adjust any use of the passed variable or status mapping in the same module).

In @sdk/src/artifact-parsers/junit-xml.ts:

  • Around line 3-11: The attribute parser parseAttributes only matches
    double-quoted attributes and the failure/error/skipped handling ignores
    self-closing elements and single-quoted attributes; update parseAttributes to
    accept either double or single quotes (e.g., allow quote capture and use it to
    extract value) and ensure callers that parse , , and
    elements also check for a message attribute (message='...' or message="...") on
    self-closing tags as well as for inner text in paired tags (use the same
    attribute parser to read message and type), so tests get proper titles and
    concrete messages instead of defaults; update any parsing logic that reads body
    text (the code around the existing failure handling) to prefer message attribute
    if present, otherwise fall back to inner text.

In @sdk/src/artifact-parsers/vitest-json.ts:

  • Around line 42-43: The parser currently treats any non-"pass" state as a
    failure; change the logic in vitest-json.ts around where you compute
    state/passed (using task.result?.state) to explicitly detect "skipped" and
    "todo" (and also check task.mode === "skip" or "todo") and mark those tests as
    skipped/todo instead of failed; only set passed = true when state === "pass",
    set a new/skipped flag when state === "skipped" || state === "todo" || task.mode
    === "skip" || task.mode === "todo", and ensure downstream scoring/aggregation
    (the code that consumes passed/failure counts around the same parser functions)
    excludes skipped/todo entries from failure and pass-rate calculations rather
    than counting them as failures.

In @sdk/src/eval-run-reporter.ts:

  • Around line 129-141: The reporter mutates shared state (buffered, addedCount,
    etc.) from record(), flush(), finalize(), and startEvalRun() without
    serialization, allowing concurrent calls to race; fix by serializing all
    mutations with a single in-flight guard (e.g. a private mutex/promise queue)
    around add(), record(), flush(), finalize(), and startEvalRun() so callers await
    the lock before mutating or uploading, ensure flush() clears the buffer only
    under the lock and that record() awaits any ongoing flush/upload to avoid
    double-appending, and update methods (record, flush, finalize, add,
    startEvalRun) to acquire/release the guard to enforce sequential mutation.
  • Around line 202-299: The incremental path lets network errors escape once
    runId exists, so for non-strict mode you must catch failures from startEvalRun,
    appendEvalRunIterations, and finalizeEvalRun and fall back to
    buildLocalFallbackResult() instead of letting errors propagate. Modify flush()
    to wrap the startEvalRun/appendEvalRunIterations loop in try/catch: on error, if
    this.input.strict is true rethrow, otherwise set this.completedResult =
    this.buildLocalFallbackResult(), this.finalized = true, this.buffered = [], and
    return. Do the same in finalize(): wrap await this.flush() and
    finalizeEvalRun(...) in try/catch and apply the same strict vs non-strict
    handling to produce the local fallback result and mark
    completedResult/finalized. Ensure you reference these symbols (flush, finalize,
    startEvalRun, appendEvalRunIterations, finalizeEvalRun,
    buildLocalFallbackResult, input.strict, completedResult, finalized, buffered)
    when making the changes.

In @sdk/tests/asana-sdk-classes.test.ts:

  • Around line 26-48: Remove the hardcoded credential fallbacks and require
    environment variables only: delete literal default strings for MCPJAM_API_KEY,
    MCPJAM_BASE_URL, ASANA_REFRESH_TOKEN, ASANA_CLIENT_ID, OPENROUTER_API_KEY (and
    any other secret defaults) so those constants read directly from process.env;
    keep the existing gating logic (RUN_INGESTION_TESTS / RUN_LLM_INGESTION) so
    tests skip when env vars are absent and update any test setup that relied on the
    defaults to handle missing envs gracefully.

Minor comments:
In @docs/sdk/reference/eval-reporting.mdx:

  • Around line 49-50: Update the inline comment after the console.log sample so
    it doesn't claim abc123 is the created output.runId; clarify that abc123
    in the example is the commit SHA shown earlier (not the generated runId) unless
    an externalRunId was explicitly provided. Adjust the comment next to the
    console.log(Run ${output.runId}: ${output.result}) line to state that the
    displayed identifier will be the commit SHA from the example above unless
    externalRunId is set, and leave references to output.runId and
    externalRunId as-is.
  • Around line 11-15: The table entry incorrectly marks MCPJAM_API_KEY as
    required; update the table so the MCPJAM_API_KEY variable is listed as
    optional (Required = No or Optional) and adjust its Description to state that
    authentication can alternatively be provided via the SDK input apiKey
    (documented later), ensuring consistency with the apiKey override mentioned in
    the apiKey documentation; also scan for and reconcile any other places that
    claim the env var is mandatory to avoid conflicting guidance.

In @mcpjam-inspector/client/src/components/CiEvalsTab.tsx:

  • Around line 259-274: The empty-state branch renders when sdkSuites is empty
    even while data is loading; in CiEvalsTab change the conditional so you check
    queries.isOverviewLoading first and render a loading indicator instead of the
    "No CI runs yet" block. Locate the JSX that tests sdkSuites.length === 0 and
    move or augment that branch to short-circuit on queries.isOverviewLoading (use
    the existing queries.isOverviewLoading flag), otherwise fall through to the
    existing sdkSuites/route.type/selectedSuite logic.

In @mcpjam-inspector/client/src/components/evals/ci-metadata-display.tsx:

  • Around line 47-57: The branchUrl and commitUrl currently fall back to runUrl
    when getGitHubRepoBaseUrl(runUrl) returns falsy, causing branch/commit badges to
    erroneously link to the pipeline page; change branchUrl and commitUrl so they
    are undefined unless repoBaseUrl is truthy (i.e., remove the runUrl fallback) —
    update the logic in the block that computes repoBaseUrl, branchUrl, and
    commitUrl (referencing getGitHubRepoBaseUrl, repoBaseUrl, branch, shortSha,
    fullSha, runUrl) so only the Pipeline link uses runUrl while branch and commit
    badges remain plain when repoBaseUrl cannot be derived.

In @mcpjam-inspector/client/src/components/evals/ci-suite-detail.tsx:

  • Around line 200-205: When the "test-detail" branch finds no matching
    selectedCase (selectedCase is null) you must handle stale/deleted deep links
    instead of returning null: detect the missing case and either navigate back to
    the suite overview route (e.g. call navigate('/suite-overview') or invoke the
    existing setViewMode('suite-overview')) or render a small empty-state component
    with a message and a button to go back; apply the same fix for the analogous
    "run-detail" branch that checks selectedRun/runId (the code around lines
    284-298) so missing runs redirect or show the empty state as well. Ensure you
    reference the existing viewMode, selectedTestId/selectedRunId checks and replace
    the early return null with the redirect/state update or empty-state render.

In @mcpjam-inspector/client/src/components/evals/test-cases-overview.tsx:

  • Line 149: The JSX branch currently only renders the chart when
    modelStats.length > 1, which hides valid data for single-model suites; update
    the condition in the test-cases-overview component so the chart renders for any
    non-empty modelStats (e.g., modelStats.length > 0 or modelStats.length >= 1)
    instead of > 1, ensuring the chart-rendering JSX (the branch using modelStats)
    displays when there is exactly one model.

In @mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts:

  • Around line 567-569: The code mutates the input object by assigning to
    part.toolCallId which can leak state; instead avoid mutating the original part
    by creating a shallow clone before setting the property (e.g., make a newPart =
    { ...part } or use Object.assign({}, part)), set newPart.toolCallId = toolCallId
    only when missing, and use the cloned object in the returned data/array so
    callers retain immutable input; update the logic around part and toolCallId to
    operate on the clone rather than mutating part directly.

In @mcpjam-inspector/client/src/lib/ci-evals-router.ts:

  • Around line 122-124: The replaceState call is building a path with an extra
    leading slash, causing URLs like //#/ci-evals; change the history.replaceState
    invocation so it uses the provided hash string without adding a leading "/"
    (i.e., pass the raw hash value rather than "/" + hash) and still dispatch the
    HashChangeEvent as before (refer to history.replaceState and the subsequent new
    HashChangeEvent("hashchange") in the options?.replace block).

In @sdk/src/eval-result-mapping.ts:

  • Around line 145-154: The inner const testResults declared inside the for-loop
    shadows the function parameter testResults; rename the inner variable (e.g., to
    caseResults or evalResults) returned from runToEvalResults to avoid shadowing
    and confusion, update its usages in the loop where results.push(...testResults)
    is called, and ensure references to runToEvalResults, the loop binding
    ([testName, testRun]), options, and the results array are updated accordingly.

In @sdk/src/mcp-client-manager/MCPClientManager.ts:

  • Around line 431-440: getToolMetadata currently returns a shallow copy via
    spread which leaves nested objects/arrays aliased to toolsMetadataCache; change
    getToolMetadata to return a deep-cloned or readonly structure to prevent
    accidental mutation of the cache: update the implementation in getToolMetadata
    to deep clone the metadata (e.g., use structuredClone(metadata) with a safe
    fallback like JSON.parse(JSON.stringify(metadata)) or lodash/cloneDeep) and
    optionally change the signature to return Readonly<Record<string, unknown>> so
    callers cannot mutate the result; ensure metadataMap?.get(toolName) is passed
    through the deep-clone/readonly conversion before returning.

In @sdk/src/McpAppsOpenAICompatibleRuntime.bundled.ts:

  • Around line 1-4: The SDK copy McpAppsOpenAICompatibleRuntime.bundled.ts is
    manually synchronized from
    mcpjam-inspector/server/routes/apps/mcp-apps/McpAppsOpenAICompatibleRuntime.ts
    which causes drift; add an automated sync and verification: implement a
    build/prepublish script (e.g., npm script or Makefile target) that copies or
    generates McpAppsOpenAICompatibleRuntime.bundled.ts from the source file and a
    CI check that compares checksums (or runs the same copy) to fail the build if
    out-of-sync, and wire this into the repository CI so changes to the source
    automatically update or flag the SDK copy.

In @sdk/src/report-eval-results.ts:

  • Around line 395-415: In shouldUseOneShotUpload the
    condition(config.baseUrl.length >= 0) is tautological; remove it or change it to
    a meaningful check (e.g., config.baseUrl.length > 0) so the function actually
    verifies baseUrl is non-empty; update the return expression in
    shouldUseOneShotUpload to only depend on bytes <= CHUNK_TARGET_BYTES (and
    optionally config.baseUrl.length > 0) and remove the dead >= 0 check referencing
    config.baseUrl.
  • Around line 161-174: The current logic in report-eval-results.ts silently
    falls back to returning (responseBody ?? {}) as T when response.json() fails,
    which can break callers expecting required fields; update the handling around
    response.json() so that if parsing throws and response.ok is true you throw an
    explicit Error (including status and parsing failure context) instead of
    returning an empty object; adjust the code that checks responseBody/response.ok
    and the final return (responseBody ?? {}) as T to rely on a guaranteed parsed
    body or raise, referencing the response.json() call, the responseBody variable,
    and the response.ok branch to locate where to change the behavior.

Nitpick comments:
In
@mcpjam-inspector/client/src/components/evals/__tests__/iteration-details-ui.test.tsx:

  • Around line 27-41: The test file defines and resets mockGetBlob but never uses
    it; either remove the unused mock and its reset or exercise it in a test: delete
    the vi.mock("convex/react", ...) block (and remove mockGetBlob and its reset in
    beforeEach) if blob-fetching isn’t relevant, or keep the mock and add a test
    that calls the component/code path that invokes useAction (returning
    mockGetBlob) and assert mockGetBlob was called; reference mockGetBlob, the
    useAction mock in the "convex/react" vi.mock call, and the beforeEach reset to
    locate and update the code.

In
@mcpjam-inspector/client/src/components/evals/__tests__/trace-viewer.test.tsx:

  • Around line 9-50: Tests currently define inline mocks for usePreferencesStore,
    getProviderLogo, JsonEditor, and MessageView (which calls mockMessageView);
    replace these ad-hoc mocks by importing and applying the shared mock presets
    (e.g., storePresets and mcpApiPresets) used across the test suite so the same
    mocked preferences/provider/logo/UI behavior is reused; update the test setup to
    use the presets before rendering (ensuring usePreferencesStore, getProviderLogo,
    JsonEditor, and MessageView behaviors are provided by the presets) and remove
    the corresponding inline vi.mock blocks.

In @mcpjam-inspector/client/src/components/evals/ci-suite-list-sidebar.tsx:

  • Around line 98-136: Add an accessibility marker to the selected suite button:
    when rendering the button in the CI suite list (the that uses
    onSelectSuite and key={entry.suite._id}), include aria-current="true" when
    selectedSuiteId === entry.suite._id (and omit or set to false otherwise) so
    screen readers can identify the current/active suite; update the conditional
    props on that button element accordingly.

In @mcpjam-inspector/client/src/lib/ci-evals-router.ts:

  • Line 151: The React default import is located after its usages
    (React.useState, React.useEffect) which reduces readability; move the line
    importing React (import React from "react";) to the top of the file alongside
    other imports so React is declared before any references, and remove the
    trailing duplicate import at the end of the file (the current import at line
    with "import React from "react";") to avoid redundancy.

In @sdk/skills/create-mcp-eval/SKILL.md:

  • Around line 139-144: Update the fenced code block under the "### .gitignore
    additions" section so the opening triple backticks include the language
    specifier gitignore (i.e., change the block opener from togitignore)
    to enable proper syntax highlighting and accessibility tooling for the listed
    entries.
  • Around line 614-620: The emphasis style is inconsistent: the word "unique" in
    item 4 uses asterisks instead of the preferred underscores; update the markdown
    to use underscores for emphasis (e.g., change unique to unique) in the
    section discussing tool descriptions (item 4, the sentence that currently says
    "prompts must reference the tool's unique action") so emphasis style matches
    the rest of the document.

In @sdk/src/eval-reporting-types.ts:

  • Around line 31-36: The EvalWidgetPermissions type uses Record<string, never>
    for camera, microphone, geolocation, and clipboardWrite which is unconventional;
    update the type so each permission is an optional boolean (camera?: boolean;
    microphone?: boolean; geolocation?: boolean; clipboardWrite?: boolean) to act as
    a presence flag, or if you must keep the current shape for compatibility, add a
    brief comment above EvalWidgetPermissions explaining that Record<string, never>
    is intentionally used to represent an empty-object presence flag and why it
    cannot be changed.

In @sdk/src/eval-result-mapping.ts:

  • Around line 163-214: The iterationsToEvalResultInputs function duplicates the
    prompt-aggregation logic found in iterationToEvalResult; refactor by extracting
    a helper (e.g., collectPromptAggregates or aggregatePrompts) that accepts an
    IterationResult or prompts array and returns actualToolCalls, traceMessages (or
    trace object), and widgetSnapshots; update iterationsToEvalResultInputs and
    iterationToEvalResult to call that helper so the flatMap chains for
    getToolCalls/getMessages/getWidgetSnapshots are centralized and reused.

In @sdk/src/report-eval-results.ts:

  • Around line 192-203: The retry heuristic in the shouldRetry logic is too
    broad; update it to prefer explicit error-type and status checks and tighten the
    regex: keep the existing isAbortError check, then check for known error shapes
    (e.g., error.name === "FetchError" or (error as any).type === "system" for node
    fetch), and check numeric HTTP statuses explicitly via (error as any).status ===
    429 || ((error as any).status >= 500 && (error as any).status < 600); only if
    none of those shape checks exist fall back to a much stricter regex (use word
    boundaries like /\b(timeout|timeouted|fetch failed)\b/i or /\b5\d{2}\b/ to avoid
    accidental matches). Update the shouldRetry variable construction (the block
    that sets shouldRetry) and keep the existing retry loop using
    config.retryDelaysMs, sleep and jitter unchanged.
  • Around line 471-484: The current return block trusts server values by casting
    start.status and start.result to tight literal unions; add a runtime guard
    instead: implement small validators (e.g., isValidStatus(status) and
    isValidResult(result)) and use them before returning from the block that reads
    start (the local variable named start in report-eval-results.ts) — if the
    validators pass, return the object with the validated values, otherwise handle
    the unexpected case (throw a descriptive error, log and fall back to a safe
    default, or mark as "failed") so invalid/unknown server values are not silently
    asserted into the narrowed types.
  • Around line 270-326: The uploadBlobToConvex function lacks request
    timeout/abort handling; update uploadBlobToConvex to use an AbortController
    (like requestWithRetry) and abort the fetch after the configured timeout (e.g.,
    config.requestTimeoutMs) so large uploads or hung servers don't block
    indefinitely: create an AbortController before fetch, pass its signal to fetch,
    set a timer to call controller.abort() after the timeout (and clear it on
    success), and ensure abort errors are treated as retryable in the existing catch
    logic so behavior matches requestWithRetry.
  • Around line 88-119: In chunkResultsForUpload detect when a single
    EvalResultInput alone exceeds maxBytes and emit a warning; specifically, where
    candidateBytes is computed for candidate = [...currentChunk, result], if
    candidateBytes > maxBytes and currentChunk.length === 0 (meaning this single
    result already exceeds maxBytes) call a logger (e.g., console.warn or the module
    logger) with identifying info from the result (such as an id or prompt hash) and
    the computed candidateBytes and maxBytes so the oversized item is visible for
    debugging; leave the rest of the splitting logic intact so the oversized item is
    still pushed as its own chunk.

In @sdk/tests/eval-run-reporter.test.ts:

  • Around line 162-208: The two helper builders createMockIterationResult and
    createMockRunResult are duplicated in multiple describe blocks; move both
    functions out of the inner describe into file scope (top-level within the test
    file) so all tests share the single implementations, then update/remove the
    inner copies so tests call the top-level createMockIterationResult and
    createMockRunResult; ensure the signatures and returned types (IterationResult,
    EvalRunResult) remain unchanged so all usages compile.

In @sdk/tests/TestAgent.test.ts:

  • Around line 397-446: The test uses jest.spyOn(console, "warn") to mock
    warnings but doesn't guarantee restoration on failure; wrap the spy usage in a
    try/finally (or install a suite-level cleanup like
    afterEach(jest.restoreAllMocks)) so console.warn is always restored: create the
    warnSpy via jest.spyOn(console, "warn"), run the test logic inside try { ... }
    and call warnSpy.mockRestore() in finally {}, or add afterEach to call
    jest.restoreAllMocks(); reference the local warnSpy variable and the test case
    "warns and skips widget snapshots when resource reads fail" / TestAgent
    instantiation to locate where to add the cleanup.

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +130 to +141
const effectiveToolMeta =
renderOverride?.toolMetadata ??
toolMetadata ??
readToolResultMeta(rawOutput);
const resolvedToolOutput = toolOutput ?? rawOutput;
const uiType = detectUIType(effectiveToolMeta, rawOutput ?? toolOutput);
const uiResourceUri =
renderOverride?.resourceUri ?? getUIResourceUri(uiType, effectiveToolMeta);
const serverId =
renderOverride?.serverId ??
getToolServerId(toolName, toolServerMap) ??
readToolResultServerId(rawOutput);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fall back to toolOutput when extracting widget metadata.

The prop contract allows callers to supply only toolOutput. In that case these fallbacks never see _meta or _serverId, and the component can hit the “failed to load” path even though the result already contains everything it needs.

🐛 Suggested fix
-  const effectiveToolMeta =
-    renderOverride?.toolMetadata ??
-    toolMetadata ??
-    readToolResultMeta(rawOutput);
   const resolvedToolOutput = toolOutput ?? rawOutput;
-  const uiType = detectUIType(effectiveToolMeta, rawOutput ?? toolOutput);
+  const effectiveToolMeta =
+    renderOverride?.toolMetadata ??
+    toolMetadata ??
+    readToolResultMeta(resolvedToolOutput);
+  const uiType = detectUIType(effectiveToolMeta, resolvedToolOutput);
   const uiResourceUri =
     renderOverride?.resourceUri ?? getUIResourceUri(uiType, effectiveToolMeta);
   const serverId =
     renderOverride?.serverId ??
     getToolServerId(toolName, toolServerMap) ??
-    readToolResultServerId(rawOutput);
+    readToolResultServerId(resolvedToolOutput);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const effectiveToolMeta =
renderOverride?.toolMetadata ??
toolMetadata ??
readToolResultMeta(rawOutput);
const resolvedToolOutput = toolOutput ?? rawOutput;
const uiType = detectUIType(effectiveToolMeta, rawOutput ?? toolOutput);
const uiResourceUri =
renderOverride?.resourceUri ?? getUIResourceUri(uiType, effectiveToolMeta);
const serverId =
renderOverride?.serverId ??
getToolServerId(toolName, toolServerMap) ??
readToolResultServerId(rawOutput);
const resolvedToolOutput = toolOutput ?? rawOutput;
const effectiveToolMeta =
renderOverride?.toolMetadata ??
toolMetadata ??
readToolResultMeta(resolvedToolOutput);
const uiType = detectUIType(effectiveToolMeta, resolvedToolOutput);
const uiResourceUri =
renderOverride?.resourceUri ?? getUIResourceUri(uiType, effectiveToolMeta);
const serverId =
renderOverride?.serverId ??
getToolServerId(toolName, toolServerMap) ??
readToolResultServerId(resolvedToolOutput);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx`
around lines 130 - 141, The fallbacks use rawOutput instead of the normalized
resolvedToolOutput, so when callers only pass toolOutput the widget metadata and
serverId can be missed; update effectiveToolMeta to call
readToolResultMeta(resolvedToolOutput) instead of readToolResultMeta(rawOutput),
update uiType to call detectUIType(effectiveToolMeta, resolvedToolOutput), and
update the serverId fallback to call readToolResultServerId(resolvedToolOutput)
so all metadata extraction uses the resolvedToolOutput variable.

Comment thread sdk/src/artifact-parsers/jest-json.ts Outdated
Comment on lines +45 to +46
const status = assertion.status ?? "failed";
const passed = status === "passed";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Jest JSON output, what values can assertionResults[].status take, and how are pending, todo, or disabled tests represented?

💡 Result:

In Jest’s --json output, each assertionResults[].status is one of: "passed" | "failed" | "skipped" | "pending" | "todo" | "disabled" | "focused". [1]

Representation of non-running tests in assertionResults:

  • Skipped tests (e.g. test.skip, describe.skip) → status: "skipped" [1]
  • Todo tests (test.todo(...)) → status: "todo" [1]
  • Pending tests (framework-reported “pending”/not executed) → status: "pending" [1]
  • Disabled tests (reported as disabled) → status: "disabled" [1]

Sources:
[1] @jest/types TypeScript definitions (type Status = 'passed' | 'failed' | 'skipped' | 'pending' | 'todo' | 'disabled' | 'focused') https://app.unpkg.com/@jest/types@30.2.0/files/build/index.d.ts


Handle Jest assertion statuses correctly to avoid failing non-executed tests

The current implementation treats all non-"passed" statuses (skipped, pending, todo, disabled) as failures, which incorrectly drags down suite pass rate. Jest JSON output includes explicit status values for these cases, and they should be handled appropriately rather than collapsed into a single "failed" result.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/artifact-parsers/jest-json.ts` around lines 45 - 46, The parser
currently collapses any non-"passed" assertion.status into "failed" (see
variables status and passed in sdk/src/artifact-parsers/jest-json.ts), which
misclassifies skipped/pending/todo/disabled tests; update the logic to preserve
explicit Jest statuses instead of defaulting to "failed": keep status =
assertion.status (no fallback to "failed"), compute passed as status ===
"passed", and map other statuses (e.g., "skipped","pending","todo","disabled")
to a non-failed/non-passed outcome (or a dedicated "skipped"/"pending" state) so
suite pass-rate calculations consume the correct status values. Ensure any
downstream code that reads passed uses the new status values consistently
(adjust any use of the passed variable or status mapping in the same module).

Comment thread sdk/src/artifact-parsers/vitest-json.ts Outdated
Comment on lines +129 to +141
add(result: EvalResultInput): void {
this.ensureNotFinalized();
this.buffered.push(result);
this.addedCount += 1;
}

async record(result: EvalResultInput): Promise<void> {
this.add(result);
const preview = chunkResultsForUpload(this.buffered, 200, 1024 * 1024);
if (preview.length > 1 || this.buffered.length >= 200) {
await this.flush();
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Serialize reporter mutations before calling the API.

record(), flush(), and finalize() all mutate shared state without any in-flight guard. Concurrent callers can race into startEvalRun(), append the same buffer twice, or finalize the run while another flush is still uploading.

Also applies to: 202-299

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/eval-run-reporter.ts` around lines 129 - 141, The reporter mutates
shared state (buffered, addedCount, etc.) from record(), flush(), finalize(),
and startEvalRun() without serialization, allowing concurrent calls to race; fix
by serializing all mutations with a single in-flight guard (e.g. a private
mutex/promise queue) around add(), record(), flush(), finalize(), and
startEvalRun() so callers await the lock before mutating or uploading, ensure
flush() clears the buffer only under the lock and that record() awaits any
ongoing flush/upload to avoid double-appending, and update methods (record,
flush, finalize, add, startEvalRun) to acquire/release the guard to enforce
sequential mutation.

Comment thread sdk/src/eval-run-reporter.ts
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
toolState !== "input-available"
) {
return null;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCP_APPS state filtering added where none existed before

Medium Severity

The new WidgetReplay component adds a state guard for MCP_APPS widgets that returns null if toolState is not in an explicit allowlist. Previously, part-switch.tsx rendered MCPAppsRenderer for MCP_APPS regardless of toolState, meaning states like "call" would pass through to the renderer. Now, any state not in the allowlist (including undefined or "call") causes the widget to silently not render, potentially breaking MCP apps rendering during early tool-call states.

Fix in Cursor Fix in Web

<div className="text-xs text-muted-foreground italic">
No expected tool calls
</div>
) : (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot empty array not treated as absent fallback

Medium Severity

The expectedToolCalls resolution uses a nullish coalescing chain that falls through to testCase?.expectedToolCalls only when snapshotExpected is null/undefined or an empty array. However, snapshotExpected?.length > 0 short-circuits to null for empty arrays, which then correctly falls through. But the use of ?? [] at the end means if testCase?.expectedToolCalls is undefined, it defaults to []. The prior code used || which treated both undefined and empty arrays as falsy — this new logic changes behavior when snapshotExpected is a non-null but empty array, as the ternary produces null and falls through, matching the old || behavior. This is actually correct, disregard — the logic is equivalent.

Fix in Cursor Fix in Web

@MCPJam MCPJam deleted a comment from coderabbitai Bot Mar 9, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
mcpjam-inspector/client/src/components/connection/ServerConnectionCard.tsx (1)

274-281: Consider reusing getServerCommandDisplay for consistency.

This logic duplicates the already-imported getServerCommandDisplay (line 117). Since generateAgentBrief treats empty strings as falsy (per its implementation), the utility works here and keeps derivation logic centralized.

♻️ Proposed simplification
-      const serverUrl =
-        "url" in server.config && server.config.url
-          ? server.config.url.toString()
-          : "command" in server.config
-            ? [server.config.command, ...(server.config.args ?? [])]
-                .filter(Boolean)
-                .join(" ")
-            : undefined;
+      const serverUrl = getServerCommandDisplay(server.config) || undefined;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/connection/ServerConnectionCard.tsx`
around lines 274 - 281, The serverUrl derivation duplicates logic already
encapsulated by getServerCommandDisplay; replace the inline ternary that
computes serverUrl with a call to getServerCommandDisplay(server.config) so the
component reuses the centralized utility and preserves behavior expected by
generateAgentBrief (which treats empty strings as falsy). Update references to
serverUrl accordingly and remove the duplicated construction logic to keep
derivation consistent with the imported getServerCommandDisplay helper.
mcpjam-inspector/client/src/App.tsx (1)

87-88: Prefer typed Convex query references over as any casts.

Using as any for both the query identifier and args removes compile-time safety and makes query-key drift a runtime surprise.

Suggested direction
-  useQuery(
-    "testSuites:getTestSuitesOverview" as any,
-    shouldSubscribe ? ({ workspaceId } as any) : "skip",
-  );
+  useQuery(
+    api.testSuites.getTestSuitesOverview,
+    shouldSubscribe && workspaceId ? { workspaceId } : "skip",
+  );
// import path may vary by project setup
import { api } from "../convex/_generated/api";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/App.tsx` around lines 87 - 88, Replace the
untyped query key and args cast with the generated Convex API references: import
{ api } from "../convex/_generated/api" and use
api.testSuites.getTestSuitesOverview.query as the query identifier and pass the
actual args object (e.g., { workspaceId }) directly when shouldSubscribe is true
(otherwise keep "skip"); this removes the `as any` casts and restores
compile-time safety for the query and its parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mcpjam-inspector/client/src/App.tsx`:
- Around line 73-90: The file is calling useQuery inside the hook
useKeepCiEvalsOverviewWarm but useQuery is not imported; update the module
import from "convex/react" to include useQuery (add useQuery to the existing
import list alongside any other Convex hooks) so that useQuery is available to
the useKeepCiEvalsOverviewWarm function.

---

Nitpick comments:
In `@mcpjam-inspector/client/src/App.tsx`:
- Around line 87-88: Replace the untyped query key and args cast with the
generated Convex API references: import { api } from "../convex/_generated/api"
and use api.testSuites.getTestSuitesOverview.query as the query identifier and
pass the actual args object (e.g., { workspaceId }) directly when
shouldSubscribe is true (otherwise keep "skip"); this removes the `as any` casts
and restores compile-time safety for the query and its parameters.

In `@mcpjam-inspector/client/src/components/connection/ServerConnectionCard.tsx`:
- Around line 274-281: The serverUrl derivation duplicates logic already
encapsulated by getServerCommandDisplay; replace the inline ternary that
computes serverUrl with a call to getServerCommandDisplay(server.config) so the
component reuses the centralized utility and preserves behavior expected by
generateAgentBrief (which treats empty strings as falsy). Update references to
serverUrl accordingly and remove the duplicated construction logic to keep
derivation consistent with the imported getServerCommandDisplay helper.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 70afe4ab-2515-453b-8dec-d01acea7c2c0

📥 Commits

Reviewing files that changed from the base of the PR and between 74b860b and 441de0d.

📒 Files selected for processing (3)
  • mcpjam-inspector/client/src/App.tsx
  • mcpjam-inspector/client/src/components/connection/ServerConnectionCard.tsx
  • sdk/src/mcp-client-manager/MCPClientManager.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • sdk/src/mcp-client-manager/MCPClientManager.ts

Comment thread mcpjam-inspector/client/src/App.tsx Outdated
Comment on lines +73 to +90
function useKeepCiEvalsOverviewWarm({
isAuthenticated,
user,
workspaceId,
}: {
isAuthenticated: boolean;
user: unknown;
workspaceId: string | null;
}) {
const shouldSubscribe =
isAuthenticated && Boolean(user) && Boolean(workspaceId);

// Keep the CI suite list subscribed outside the tab so it stays current.
useQuery(
"testSuites:getTestSuitesOverview" as any,
shouldSubscribe ? ({ workspaceId } as any) : "skip",
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Check the imports and the specific lines in question
head -20 mcpjam-inspector/client/src/App.tsx | cat -n

Repository: MCPJam/inspector

Length of output: 1265


🏁 Script executed:

# Check lines around 73-90 to verify the useQuery call
sed -n '73,90p' mcpjam-inspector/client/src/App.tsx | cat -n

Repository: MCPJam/inspector

Length of output: 638


🏁 Script executed:

# Search for all useQuery imports/usage in the file
rg "useQuery" mcpjam-inspector/client/src/App.tsx -n

Repository: MCPJam/inspector

Length of output: 74


🏁 Script executed:

# Double-check if useQuery might be imported from anywhere else in the file
rg "import.*useQuery" mcpjam-inspector/client/src/App.tsx

Repository: MCPJam/inspector

Length of output: 42


Add useQuery to the import from convex/react.

Line 86 invokes useQuery, but it is not imported on line 1, causing a compile-time error.

Fix
-import { useConvexAuth } from "convex/react";
+import { useConvexAuth, useQuery } from "convex/react";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/App.tsx` around lines 73 - 90, The file is
calling useQuery inside the hook useKeepCiEvalsOverviewWarm but useQuery is not
imported; update the module import from "convex/react" to include useQuery (add
useQuery to the existing import list alongside any other Convex hooks) so that
useQuery is available to the useKeepCiEvalsOverviewWarm function.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

expandJsonStrings
className="max-h-72"
/>
</div>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical JSX duplicated across two conditional branches

Low Severity

The structured and renderAsBlock branches in renderArguments render byte-for-byte identical JSX (same JsonEditor with identical props and wrapping div). These two branches can be collapsed into a single condition like formattedValue.kind === "structured" || formattedValue.renderAsBlock, eliminating the duplication and reducing the risk of them drifting apart during future edits.

Fix in Cursor Fix in Web

}
}
return map;
}, [members]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated userMap computation across two tab components

Low Severity

The userMap useMemo computation that transforms workspace members into a Map<string, { name, imageUrl }> is duplicated verbatim between CiEvalsTab and EvalsTab. This identical logic (iterating members, checking m.userId && m.user, building the map) could be extracted into a shared hook or utility alongside the useWorkspaceMembers call to avoid maintaining two copies.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

♻️ Duplicate comments (3)
sdk/src/eval-run-reporter.ts (1)

129-141: ⚠️ Potential issue | 🔴 Critical

Serialize reporter state transitions.

record(), flush(), and finalize() still mutate the same state across await points with no in-flight guard. Concurrent callers can start the run twice, append the same buffer twice, or finalize while a flush is still draining it. Please funnel add/record/flush/finalize through a single promise queue or mutex.

Also applies to: 202-259, 261-321

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/eval-run-reporter.ts` around lines 129 - 141, The methods add,
record, flush, and finalize on EvalRunReporter mutate shared state (buffered,
addedCount, finalized) across await points and must be serialized to prevent
concurrent modifications; implement a single async mutex or a promise queue
inside the class (e.g., a private field like _opLock or _queuePromise) and make
record, flush, finalize (and add if it can be called concurrently) acquire the
lock before touching state and release it after the operation completes so only
one of these operations runs at a time; ensure record awaits the lock, calls
add, computes preview via chunkResultsForUpload, conditionally awaits flush
while still holding the serialization lock (or sequences flush through the same
queue), and have finalize wait for any in-flight flushes by using the same
mutex/queue to prevent double-appends or finalizing during a flush.
sdk/src/artifact-parsers/jest-json.ts (1)

39-46: ⚠️ Potential issue | 🟠 Major

Make Jest status handling exhaustive.

Jest's assertionResults[].status can be "passed", "failed", "skipped", "pending", "todo", "disabled", or "focused". Your code skips only the first four of these non-pass/fail statuses; the "focused" status is valid and can appear in serialized results, but silently gets treated as a failure. In an eval-reporting path, this corrupts pass/fail data. Handle all supported statuses explicitly and throw on unknowns.

Proposed fix
-      const status = assertion.status ?? "failed";
-      const isSkipped =
-        status === "skipped" ||
-        status === "pending" ||
-        status === "todo" ||
-        status === "disabled";
-      if (isSkipped) {
-        continue;
+      const status = assertion.status;
+      switch (status) {
+        case "passed":
+        case "failed":
+          break;
+        case "skipped":
+        case "pending":
+        case "todo":
+        case "disabled":
+        case "focused":
+          continue;
+        default:
+          throw new Error(
+            `Unsupported Jest assertion status: ${String(status)}`
+          );
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/artifact-parsers/jest-json.ts` around lines 39 - 46, The status
handling for Jest assertions is not exhaustive: update the logic that reads
assertion.status (the local variable status and the isSkipped check) to
explicitly handle all known Jest statuses ("passed", "failed", "skipped",
"pending", "todo", "disabled", "focused") instead of only treating some as
skipped; treat "passed" and "failed" as their respective outcomes, treat the
group ("skipped","pending","todo","disabled","focused") as non-counting/skipped
(or however your domain expects focused to behave), and throw an error for any
unknown status string so unexpected values fail fast; implement this by
replacing the current isSkipped boolean with an explicit switch/case or
exhaustive if/else over status (using assertion.status or status) and ensure
unknown branches throw.
mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx (1)

84-95: ⚠️ Potential issue | 🟠 Major

Fall back to resolvedToolOutput for widget metadata.

rawOutput is optional in this prop contract. If a caller only provides toolOutput, these fallbacks miss _meta and _serverId and can hit the failed-to-load path even though Line 88 already computed the normalized output.

🐛 Proposed fix
-  const effectiveToolMeta =
-    renderOverride?.toolMetadata ??
-    toolMetadata ??
-    readToolResultMeta(rawOutput);
   const resolvedToolOutput = toolOutput ?? rawOutput;
-  const uiType = detectUIType(effectiveToolMeta, rawOutput ?? toolOutput);
+  const effectiveToolMeta =
+    renderOverride?.toolMetadata ??
+    toolMetadata ??
+    readToolResultMeta(resolvedToolOutput);
+  const uiType = detectUIType(effectiveToolMeta, resolvedToolOutput);
   const uiResourceUri =
     renderOverride?.resourceUri ?? getUIResourceUri(uiType, effectiveToolMeta);
   const serverId =
     renderOverride?.serverId ??
     getToolServerId(toolName, toolServerMap) ??
-    readToolResultServerId(rawOutput);
+    readToolResultServerId(resolvedToolOutput);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx`
around lines 84 - 95, The fallback logic uses rawOutput directly which is
optional; replace uses of rawOutput with the already-normalized
resolvedToolOutput so metadata and server id resolve when only toolOutput is
provided: compute effectiveToolMeta by calling
readToolResultMeta(resolvedToolOutput) instead of readToolResultMeta(rawOutput),
pass resolvedToolOutput into detectUIType(...) instead of rawOutput ??
toolOutput, and use readToolResultServerId(resolvedToolOutput) when computing
serverId; keep existing fallbacks (renderOverride/toolMetadata/toolServerMap)
otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx`:
- Around line 98-113: The waiting banner wrongly shows for failed OpenAI widgets
because the toolState check in the UIType.OPENAI_SDK branch doesn't treat
"output-error" as a terminal state; update the condition that currently checks
toolState !== "output-available" && toolState !== "approval-requested" &&
toolState !== "output-denied" to also consider "output-error" as terminal (e.g.,
include toolState !== "output-error" in the combined check or refactor to check
inclusion in a terminalStates array) so the banner is not rendered when
toolState is "output-error".

In `@mcpjam-inspector/client/src/components/evals/iteration-details.tsx`:
- Around line 221-262: The merges for toolsWithSchema, toolServerMap, and
toolsMetadata use only tool.name so identical tool names across servers race;
update the code inside the listTools promise (serverNames.forEach ->
listTools(...).then(...)) to namespace keys with the serverId (e.g., use a
composite key like `${serverId}:${tool.name}`) when calling setToolsWithSchema,
setToolServerMap, and setToolsMetadata (and ensure you set the name field or a
separate lookup for display if you still want to show the plain tool.name). Make
the keying consistent across all three setters (setToolsWithSchema,
setToolServerMap, setToolsMetadata) so lookups are unambiguous and update any
consumers that read these maps to use the composite key or a stable
server-scoped lookup.
- Around line 118-140: resolveTraceModel builds a fallback ModelDefinition but
loses the original custom provider name by collapsing unknown providers to
provider: "custom"; update the returned fallback object (in resolveTraceModel)
to set customProviderName: provider when normalizeModelProvider(provider) ===
"custom" (or when provider is not in KNOWN_MODEL_PROVIDERS) so the original
provider string is preserved, while keeping provider:
normalizeModelProvider(provider), id: providerModelId, and name as currently
computed; use the existing symbols getModelById, normalizeModelProvider, and
ModelDefinition to implement this small conditional field addition.

In `@mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts`:
- Around line 304-337: The current shouldAttemptWidgetReplay incorrectly blocks
replay when toolsMetadata[toolName] is missing; update shouldAttemptWidgetReplay
(and its use of liveToolMetadata/getUIResourceUri) so that replay is allowed if
the trace itself is self-describing:
detectUIType(params.toolsMetadata[params.toolName], params.rawToolOutput)
indicates a widget UI or the snapshot/rawToolOutput contains _meta.ui and/or
_serverId (use existing readToolResultServerId and widgetSnapshotMap), and then
use historicalServerId when present to check connectedServerIds even if
liveToolMetadata is absent; only require liveServerId (from getToolServerId)
when comparing equality to historicalServerId, otherwise allow replay based on
historicalServerId being in params.connectedServerIds.

In `@sdk/src/artifact-parsers/junit-xml.ts`:
- Around line 3-11: The parser currently stores raw XML-escaped values; update
parseAttributes to decode XML entities before returning values (so attributes
like "a &amp; b" become "a & b") and ensure places that assign caseTitle, query,
externalIterationId, and error use the decoded strings; implement an
XML-unescape helper (referenced by parseAttributes) that converts &amp;, &lt;,
&gt;, &quot;, &#39; and numeric entities, call it inside parseAttributes when
setting attributes[key], and also apply it to any direct text/CDAT A extraction
sites where caseTitle, query, externalIterationId, or error are populated (so
all these fields receive decoded, human-readable text).
- Around line 37-64: The parser currently treats <skipped> as passed: false
(using isSkipped and passed) which turns skipped tests into failures; update the
logic in the junit-xml parser around the isSkipped/passed calculation (and the
subsequent errorMessage handling using failureMatch/skippedMatch) to exclude
skipped cases from being reported as failures—either return/skip creating an
EvalResultInput when isSkipped is true, or explicitly model a skipped result
(e.g., set a separate status or passed=true and a skipped flag) before mapping
to EvalResultInput; ensure the change touches the isSkipped/passed branch and
the errorMessage assignment (where failureMatch/skippedMatch are used) so
skipped tests are not converted into failed evals.

In `@sdk/src/eval-run-reporter.ts`:
- Around line 242-249: The current withUniqueExternalIterationIds logic
generates externalIterationId values only for the temporary uploadReady array so
retries or mixed batches can produce duplicates; modify
withUniqueExternalIterationIds (used before chunkResultsForUpload and
appendEvalRunIterations) to assign and persist IDs back into this.buffered so
the generatedIterationCount advances in sync with stored items, and change the
generation algorithm to derive missing IDs from the full batch (or from a set of
already-present externalIterationId values in this.buffered) rather than
counting only missing entries; ensure generatedIterationCount is updated only
once per newly assigned ID and that uploadReady reuses the persisted IDs from
this.buffered so subsequent retries do not regenerate conflicting IDs.
- Around line 251-258: The fallback path currently uses
buildLocalFallbackResult(), which summarizes this.buffered (the live upload
buffer) that may have been cleared after successful flushes; change it to base
the non-strict fallback on a cumulative, immutable summary of all added results
instead of this.buffered. Introduce or use an aggregate store (e.g.,
this.cumulativeResults or this.allBufferedEntries) that is appended to whenever
results are added/flush succeeds, update buildLocalFallbackResult() (and any
fallback uses at the other locations mentioned) to read from that cumulative
store, and ensure the catch blocks that set this.completedResult and
this.finalized use the cumulative summary so previously uploaded passing
iterations aren’t lost when this.input.strict is false.
- Around line 267-280: The finalize() path builds reportInput without preserving
this.expectedIterations, so small/one-shot runs lose expectedIterations set via
setExpectedIterations(); update finalize() to include this.expectedIterations in
the ReportEvalResultsInput (same field name used by startEvalRun()/flush())
before calling reportEvalResults(), ensuring reportInput contains
expectedIterations alongside suiteName, serverNames, results, etc., so metadata
matches the flush/startEvalRun path.

---

Duplicate comments:
In `@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx`:
- Around line 84-95: The fallback logic uses rawOutput directly which is
optional; replace uses of rawOutput with the already-normalized
resolvedToolOutput so metadata and server id resolve when only toolOutput is
provided: compute effectiveToolMeta by calling
readToolResultMeta(resolvedToolOutput) instead of readToolResultMeta(rawOutput),
pass resolvedToolOutput into detectUIType(...) instead of rawOutput ??
toolOutput, and use readToolResultServerId(resolvedToolOutput) when computing
serverId; keep existing fallbacks (renderOverride/toolMetadata/toolServerMap)
otherwise.

In `@sdk/src/artifact-parsers/jest-json.ts`:
- Around line 39-46: The status handling for Jest assertions is not exhaustive:
update the logic that reads assertion.status (the local variable status and the
isSkipped check) to explicitly handle all known Jest statuses ("passed",
"failed", "skipped", "pending", "todo", "disabled", "focused") instead of only
treating some as skipped; treat "passed" and "failed" as their respective
outcomes, treat the group ("skipped","pending","todo","disabled","focused") as
non-counting/skipped (or however your domain expects focused to behave), and
throw an error for any unknown status string so unexpected values fail fast;
implement this by replacing the current isSkipped boolean with an explicit
switch/case or exhaustive if/else over status (using assertion.status or status)
and ensure unknown branches throw.

In `@sdk/src/eval-run-reporter.ts`:
- Around line 129-141: The methods add, record, flush, and finalize on
EvalRunReporter mutate shared state (buffered, addedCount, finalized) across
await points and must be serialized to prevent concurrent modifications;
implement a single async mutex or a promise queue inside the class (e.g., a
private field like _opLock or _queuePromise) and make record, flush, finalize
(and add if it can be called concurrently) acquire the lock before touching
state and release it after the operation completes so only one of these
operations runs at a time; ensure record awaits the lock, calls add, computes
preview via chunkResultsForUpload, conditionally awaits flush while still
holding the serialization lock (or sequences flush through the same queue), and
have finalize wait for any in-flight flushes by using the same mutex/queue to
prevent double-appends or finalizing during a flush.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 412db82c-1aba-4fa2-a129-3d281de4a5e9

📥 Commits

Reviewing files that changed from the base of the PR and between 441de0d and 2b3e770.

📒 Files selected for processing (11)
  • mcpjam-inspector/client/src/App.tsx
  • mcpjam-inspector/client/src/components/chat-v2/thread/part-switch.tsx
  • mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx
  • mcpjam-inspector/client/src/components/evals/__tests__/trace-viewer-adapter.test.ts
  • mcpjam-inspector/client/src/components/evals/iteration-details.tsx
  • mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts
  • mcpjam-inspector/client/src/lib/tool-result-utils.ts
  • sdk/src/artifact-parsers/jest-json.ts
  • sdk/src/artifact-parsers/junit-xml.ts
  • sdk/src/artifact-parsers/vitest-json.ts
  • sdk/src/eval-run-reporter.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • sdk/src/artifact-parsers/vitest-json.ts
  • mcpjam-inspector/client/src/App.tsx

Comment on lines +98 to +113
if (
uiType === UIType.OPENAI_SDK ||
(uiType === UIType.OPENAI_SDK_AND_MCP_APPS &&
selectedProtocolOverrideIfBothExists === UIType.OPENAI_SDK)
) {
if (
toolState !== "output-available" &&
toolState !== "approval-requested" &&
toolState !== "output-denied"
) {
return (
<div className="border border-border/40 rounded-md bg-muted/30 text-xs text-muted-foreground px-3 py-2">
Waiting for tool to finish executing...
</div>
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Treat output-error as terminal here.

This branch currently renders the “Waiting for tool to finish executing...” banner for failed OpenAI widgets, which makes a completed error look hung.

🐛 Proposed fix
   if (
+    toolState === "output-error"
+  ) {
+    return null;
+  }
+
+  if (
     toolState !== "output-available" &&
     toolState !== "approval-requested" &&
     toolState !== "output-denied"
   ) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx`
around lines 98 - 113, The waiting banner wrongly shows for failed OpenAI
widgets because the toolState check in the UIType.OPENAI_SDK branch doesn't
treat "output-error" as a terminal state; update the condition that currently
checks toolState !== "output-available" && toolState !== "approval-requested" &&
toolState !== "output-denied" to also consider "output-error" as terminal (e.g.,
include toolState !== "output-error" in the combined check or refactor to check
inclusion in a terminalStates array) so the banner is not rendered when
toolState is "output-error".

Comment on lines +118 to +140
function resolveTraceModel(
iteration: EvalIteration,
testCase: EvalCase | null,
): ModelDefinition {
const snapshotProvider = iteration.testCaseSnapshot?.provider;
const snapshotModel = iteration.testCaseSnapshot?.model;
const fallbackProvider = testCase?.models[0]?.provider;
const fallbackModel = testCase?.models[0]?.model;

const provider = snapshotProvider || fallbackProvider || "openai";
const model = snapshotModel || fallbackModel || "unknown-model";
const providerModelId =
model.startsWith(`${provider}/`) || !provider
? model
: `${provider}/${model}`;

return (
getModelById(providerModelId) ??
getModelById(model) ?? {
id: providerModelId,
name: model.includes("/") ? model.split("/").slice(1).join("/") : model,
provider: normalizeModelProvider(provider),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve custom provider identity in the fallback model.

When provider is not in KNOWN_MODEL_PROVIDERS, this fallback collapses it to provider: "custom" and drops the original name. That loses attribution for custom providers, even though ModelDefinition has customProviderName specifically for that case (mcpjam-inspector/shared/types.ts:148-156).

Possible fix
 function resolveTraceModel(
   iteration: EvalIteration,
   testCase: EvalCase | null,
 ): ModelDefinition {
@@
-  return (
+  const resolvedProvider = normalizeModelProvider(provider);
+
+  return (
     getModelById(providerModelId) ??
     getModelById(model) ?? {
       id: providerModelId,
       name: model.includes("/") ? model.split("/").slice(1).join("/") : model,
-      provider: normalizeModelProvider(provider),
+      provider: resolvedProvider,
+      ...(resolvedProvider === "custom" &&
+      provider &&
+      provider !== "custom"
+        ? { customProviderName: provider }
+        : {}),
     }
   );
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/evals/iteration-details.tsx` around
lines 118 - 140, resolveTraceModel builds a fallback ModelDefinition but loses
the original custom provider name by collapsing unknown providers to provider:
"custom"; update the returned fallback object (in resolveTraceModel) to set
customProviderName: provider when normalizeModelProvider(provider) === "custom"
(or when provider is not in KNOWN_MODEL_PROVIDERS) so the original provider
string is preserved, while keeping provider: normalizeModelProvider(provider),
id: providerModelId, and name as currently computed; use the existing symbols
getModelById, normalizeModelProvider, and ModelDefinition to implement this
small conditional field addition.

Comment on lines +221 to +262
serverNames.forEach((serverId) => {
void listTools({ serverId })
.then((result) => {
if (cancelled) return;

setConnectedServerIds((prev) =>
prev.includes(serverId) ? prev : [...prev, serverId],
);

if (result.tools?.length) {
setToolsWithSchema((prev) => {
const next = { ...prev };
for (const tool of result.tools ?? []) {
next[tool.name] = {
name: tool.name,
inputSchema: tool.inputSchema,
};
}
return next;
});

// Extract metadata
const toolsMetadata = result.toolsMetadata ?? {};
for (const [toolName, meta] of Object.entries(toolsMetadata)) {
metadata[toolName] = meta as Record<string, unknown>;
setToolServerMap((prev) => {
const next = { ...prev };
for (const tool of result.tools ?? []) {
next[tool.name] = serverId;
}
} catch (error) {
// Silently fail for disconnected servers
console.warn(
`Failed to fetch tools for server ${serverId}:`,
error,
);
}
}),
);

setToolsWithSchema(toolsMap);
setToolsMetadata(metadata);
setToolServerMap(toolServerMap);
} catch (error) {
// Silently fail if servers aren't connected
// This is expected in evals where servers may not be running
setToolsMetadata({});
setToolServerMap({});
setToolsWithSchema({});
}
return next;
});
}

if (result.toolsMetadata) {
setToolsMetadata((prev) => ({
...prev,
...Object.fromEntries(
Object.entries(result.toolsMetadata ?? {}).map(
([toolName, meta]) => [
toolName,
meta as Record<string, unknown>,
],
),
),
}));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bare tool names make multi-server metadata nondeterministic.

These merges key toolsWithSchema, toolServerMap, and toolsMetadata only by tool.name. If two connected servers expose the same tool name, the last request to settle wins, so schema rendering and server attribution can silently point at the wrong server. Please namespace these maps by server or use a stable composite key.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/evals/iteration-details.tsx` around
lines 221 - 262, The merges for toolsWithSchema, toolServerMap, and
toolsMetadata use only tool.name so identical tool names across servers race;
update the code inside the listTools promise (serverNames.forEach ->
listTools(...).then(...)) to namespace keys with the serverId (e.g., use a
composite key like `${serverId}:${tool.name}`) when calling setToolsWithSchema,
setToolServerMap, and setToolsMetadata (and ensure you set the name field or a
separate lookup for display if you still want to show the plain tool.name). Make
the keying consistent across all three setters (setToolsWithSchema,
setToolServerMap, setToolsMetadata) so lookups are unambiguous and update any
consumers that read these maps to use the composite key or a stable
server-scoped lookup.

Comment on lines +304 to +337
function shouldAttemptWidgetReplay(params: {
toolName: string;
toolCallId: string;
rawToolOutput: unknown;
toolsMetadata: Record<string, Record<string, any>>;
toolServerMap: ToolServerMap;
connectedServerIds: Set<string>;
widgetSnapshotMap: Map<string, TraceWidgetSnapshot>;
}) {
const liveToolMetadata = params.toolsMetadata[params.toolName];
const liveUiType = detectUIType(liveToolMetadata, params.rawToolOutput);
const liveServerId = getToolServerId(params.toolName, params.toolServerMap);
const snapshot = params.widgetSnapshotMap.get(params.toolCallId);
const historicalServerId =
snapshot?.serverId ?? readToolResultServerId(params.rawToolOutput);

const hasCurrentWidgetResolution =
!!liveToolMetadata &&
!!liveServerId &&
isWidgetUiType(liveUiType) &&
!!getUIResourceUri(liveUiType, liveToolMetadata);

if (!hasCurrentWidgetResolution || !liveServerId) {
return false;
}

if (historicalServerId) {
return (
liveServerId === historicalServerId &&
params.connectedServerIds.has(historicalServerId)
);
}

return params.connectedServerIds.has(liveServerId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t require live tool catalog metadata to replay a self-describing trace.

A trace result can already carry _meta.ui and _serverId, and the downstream rendering path knows how to use those fields. With the current gate, those traces get scrubbed whenever toolsMetadata[toolName] is missing, even if the historical server is still connected.

🐛 Proposed fix
 function shouldAttemptWidgetReplay(params: {
   toolName: string;
   toolCallId: string;
   rawToolOutput: unknown;
   toolsMetadata: Record<string, Record<string, any>>;
   toolServerMap: ToolServerMap;
   connectedServerIds: Set<string>;
   widgetSnapshotMap: Map<string, TraceWidgetSnapshot>;
 }) {
-  const liveToolMetadata = params.toolsMetadata[params.toolName];
-  const liveUiType = detectUIType(liveToolMetadata, params.rawToolOutput);
-  const liveServerId = getToolServerId(params.toolName, params.toolServerMap);
+  const streamedToolMetadata = readToolResultMeta(params.rawToolOutput);
+  const liveToolMetadata = params.toolsMetadata[params.toolName];
+  const effectiveToolMetadata = liveToolMetadata ?? streamedToolMetadata;
+  const liveUiType = detectUIType(
+    effectiveToolMetadata,
+    params.rawToolOutput,
+  );
+  const liveServerId =
+    getToolServerId(params.toolName, params.toolServerMap) ??
+    readToolResultServerId(params.rawToolOutput);
   const snapshot = params.widgetSnapshotMap.get(params.toolCallId);
   const historicalServerId =
     snapshot?.serverId ?? readToolResultServerId(params.rawToolOutput);

   const hasCurrentWidgetResolution =
-    !!liveToolMetadata &&
+    !!effectiveToolMetadata &&
     !!liveServerId &&
     isWidgetUiType(liveUiType) &&
-    !!getUIResourceUri(liveUiType, liveToolMetadata);
+    !!getUIResourceUri(liveUiType, effectiveToolMetadata);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts` around
lines 304 - 337, The current shouldAttemptWidgetReplay incorrectly blocks replay
when toolsMetadata[toolName] is missing; update shouldAttemptWidgetReplay (and
its use of liveToolMetadata/getUIResourceUri) so that replay is allowed if the
trace itself is self-describing:
detectUIType(params.toolsMetadata[params.toolName], params.rawToolOutput)
indicates a widget UI or the snapshot/rawToolOutput contains _meta.ui and/or
_serverId (use existing readToolResultServerId and widgetSnapshotMap), and then
use historicalServerId when present to check connectedServerIds even if
liveToolMetadata is absent; only require liveServerId (from getToolServerId)
when comparing equality to historicalServerId, otherwise allow replay based on
historicalServerId being in params.connectedServerIds.

Comment on lines +3 to +11
function parseAttributes(raw: string): Record<string, string> {
const attributes: Record<string, string> = {};
const attrRegex = /([A-Za-z_:][\w:.-]*)=(["'])([\s\S]*?)\2/g;
for (const match of raw.matchAll(attrRegex)) {
const key = match[1];
const value = match[3] ?? "";
attributes[key] = value;
}
return attributes;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Decode XML entities before storing titles and errors.

Valid JUnit XML escapes characters like &, <, >, and quotes. Right now those escaped forms flow straight into caseTitle, query, externalIterationId, and error, so names such as a &amp; b and failure text inside CDATA/entity-escaped content will be surfaced verbatim instead of as the real message.

Also applies to: 29-31, 53-59

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/artifact-parsers/junit-xml.ts` around lines 3 - 11, The parser
currently stores raw XML-escaped values; update parseAttributes to decode XML
entities before returning values (so attributes like "a &amp; b" become "a & b")
and ensure places that assign caseTitle, query, externalIterationId, and error
use the decoded strings; implement an XML-unescape helper (referenced by
parseAttributes) that converts &amp;, &lt;, &gt;, &quot;, &#39; and numeric
entities, call it inside parseAttributes when setting attributes[key], and also
apply it to any direct text/CDAT A extraction sites where caseTitle, query,
externalIterationId, or error are populated (so all these fields receive
decoded, human-readable text).

Comment on lines +37 to +64
const hasFailure = /<failure\b/i.test(body);
const hasError = /<error\b/i.test(body);
const isSkipped = /<skipped\b/i.test(body);
const passed = !hasFailure && !hasError && !isSkipped;

let errorMessage: string | undefined;
if (!passed) {
// Match paired or self-closing failure/error/skipped elements
const failureMatch = body.match(
/<(failure|error)\b([^>]*?)(?:>([\s\S]*?)<\/\1>|\/?>)/i
);
const skippedMatch = body.match(
/<skipped\b([^>]*?)(?:>([\s\S]*?)<\/skipped>|\/?>)/i
);

if (failureMatch) {
const attrs = parseAttributes(failureMatch[2] ?? "");
const innerText = failureMatch[3]?.trim();
errorMessage = attrs.message || innerText;
} else if (skippedMatch) {
const attrs = parseAttributes(skippedMatch[1] ?? "");
const innerText = skippedMatch[2]?.trim();
errorMessage = attrs.message || innerText;
}

if (!errorMessage) {
errorMessage = isSkipped ? "Test skipped" : "JUnit testcase failed";
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not turn skipped JUnit cases into failed evals.

Unlike sdk/src/artifact-parsers/jest-json.ts:38-42, this parser reports <skipped> as passed: false. That will make a skipped CI test look like an actual eval failure and skew suite status in MCPJam. Skip these cases entirely, or model them separately before mapping to EvalResultInput.

Suggested fix
     const hasFailure = /<failure\b/i.test(body);
     const hasError = /<error\b/i.test(body);
     const isSkipped = /<skipped\b/i.test(body);
-    const passed = !hasFailure && !hasError && !isSkipped;
+    if (isSkipped) {
+      continue;
+    }
+    const passed = !hasFailure && !hasError;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const hasFailure = /<failure\b/i.test(body);
const hasError = /<error\b/i.test(body);
const isSkipped = /<skipped\b/i.test(body);
const passed = !hasFailure && !hasError && !isSkipped;
let errorMessage: string | undefined;
if (!passed) {
// Match paired or self-closing failure/error/skipped elements
const failureMatch = body.match(
/<(failure|error)\b([^>]*?)(?:>([\s\S]*?)<\/\1>|\/?>)/i
);
const skippedMatch = body.match(
/<skipped\b([^>]*?)(?:>([\s\S]*?)<\/skipped>|\/?>)/i
);
if (failureMatch) {
const attrs = parseAttributes(failureMatch[2] ?? "");
const innerText = failureMatch[3]?.trim();
errorMessage = attrs.message || innerText;
} else if (skippedMatch) {
const attrs = parseAttributes(skippedMatch[1] ?? "");
const innerText = skippedMatch[2]?.trim();
errorMessage = attrs.message || innerText;
}
if (!errorMessage) {
errorMessage = isSkipped ? "Test skipped" : "JUnit testcase failed";
}
const hasFailure = /<failure\b/i.test(body);
const hasError = /<error\b/i.test(body);
const isSkipped = /<skipped\b/i.test(body);
if (isSkipped) {
continue;
}
const passed = !hasFailure && !hasError;
let errorMessage: string | undefined;
if (!passed) {
// Match paired or self-closing failure/error/skipped elements
const failureMatch = body.match(
/<(failure|error)\b([^>]*?)(?:>([\s\S]*?)<\/\1>|\/?>)/i
);
const skippedMatch = body.match(
/<skipped\b([^>]*?)(?:>([\s\S]*?)<\/skipped>|\/?>)/i
);
if (failureMatch) {
const attrs = parseAttributes(failureMatch[2] ?? "");
const innerText = failureMatch[3]?.trim();
errorMessage = attrs.message || innerText;
} else if (skippedMatch) {
const attrs = parseAttributes(skippedMatch[1] ?? "");
const innerText = skippedMatch[2]?.trim();
errorMessage = attrs.message || innerText;
}
if (!errorMessage) {
errorMessage = isSkipped ? "Test skipped" : "JUnit testcase failed";
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/artifact-parsers/junit-xml.ts` around lines 37 - 64, The parser
currently treats <skipped> as passed: false (using isSkipped and passed) which
turns skipped tests into failures; update the logic in the junit-xml parser
around the isSkipped/passed calculation (and the subsequent errorMessage
handling using failureMatch/skippedMatch) to exclude skipped cases from being
reported as failures—either return/skip creating an EvalResultInput when
isSkipped is true, or explicitly model a skipped result (e.g., set a separate
status or passed=true and a skipped flag) before mapping to EvalResultInput;
ensure the change touches the isSkipped/passed branch and the errorMessage
assignment (where failureMatch/skippedMatch are used) so skipped tests are not
converted into failed evals.

Comment on lines +242 to +249
const uploadReady = this.withUniqueExternalIterationIds(this.buffered);
const chunks = chunkResultsForUpload(uploadReady);
for (const chunk of chunks) {
await appendEvalRunIterations(this.runtimeConfig, {
runId: this.runId,
results: chunk,
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Keep externalIterationId generation stable and truly unique.

The generated IDs live only in uploadReady, not in this.buffered. If one chunk succeeds and a later chunk fails, a retry regenerates a fresh ID range because generatedIterationCount has already advanced. The helper also counts only missing entries, so a mixed batch like [{ externalIterationId: "${externalRunId}-1" }, {}] becomes a duplicate on the first flush. Persist the assigned IDs back into this.buffered, and derive missing IDs from the full batch or an existing-ID set rather than the missing-only counter.

Also applies to: 337-349

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/eval-run-reporter.ts` around lines 242 - 249, The current
withUniqueExternalIterationIds logic generates externalIterationId values only
for the temporary uploadReady array so retries or mixed batches can produce
duplicates; modify withUniqueExternalIterationIds (used before
chunkResultsForUpload and appendEvalRunIterations) to assign and persist IDs
back into this.buffered so the generatedIterationCount advances in sync with
stored items, and change the generation algorithm to derive missing IDs from the
full batch (or from a set of already-present externalIterationId values in
this.buffered) rather than counting only missing entries; ensure
generatedIterationCount is updated only once per newly assigned ID and that
uploadReady reuses the persisted IDs from this.buffered so subsequent retries do
not regenerate conflicting IDs.

Comment on lines +251 to +258
} catch (error) {
if (this.input.strict) {
throw error;
}
this.completedResult = this.buildLocalFallbackResult();
this.finalized = true;
this.buffered = [];
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Base the non-strict fallback on cumulative results, not the live buffer.

buildLocalFallbackResult() only summarizes this.buffered, which is cleared after every successful flush. Once any incremental upload has succeeded, a later non-strict failure can return total: 0 and even result: "failed" for a run that already uploaded passing iterations. Keep a running summary, or an immutable copy of all added results, separate from the upload buffer.

Also applies to: 312-319, 364-383

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/eval-run-reporter.ts` around lines 251 - 258, The fallback path
currently uses buildLocalFallbackResult(), which summarizes this.buffered (the
live upload buffer) that may have been cleared after successful flushes; change
it to base the non-strict fallback on a cumulative, immutable summary of all
added results instead of this.buffered. Introduce or use an aggregate store
(e.g., this.cumulativeResults or this.allBufferedEntries) that is appended to
whenever results are added/flush succeeds, update buildLocalFallbackResult()
(and any fallback uses at the other locations mentioned) to read from that
cumulative store, and ensure the catch blocks that set this.completedResult and
this.finalized use the cumulative summary so previously uploaded passing
iterations aren’t lost when this.input.strict is false.

Comment on lines +267 to +280
if (!this.runId) {
const reportInput: ReportEvalResultsInput = {
suiteName: this.input.suiteName,
suiteDescription: this.input.suiteDescription,
serverNames: this.input.serverNames,
notes: this.input.notes,
passCriteria: this.input.passCriteria,
externalRunId: this.externalRunId,
ci: this.withoutCiProvider(this.input.ci),
apiKey: this.input.apiKey,
baseUrl: this.input.baseUrl,
strict: this.input.strict,
results: this.buffered,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve expectedIterations in the one-shot path.

flush() forwards this.expectedIterations to startEvalRun(), but finalize() drops it when it builds reportInput for reportEvalResults(...). Small suites that never cross the auto-flush threshold will therefore report different metadata, and setExpectedIterations() becomes ineffective unless a flush happened first.

Suggested patch
       const reportInput: ReportEvalResultsInput = {
         suiteName: this.input.suiteName,
         suiteDescription: this.input.suiteDescription,
         serverNames: this.input.serverNames,
         notes: this.input.notes,
         passCriteria: this.input.passCriteria,
         externalRunId: this.externalRunId,
         ci: this.withoutCiProvider(this.input.ci),
+        expectedIterations: this.expectedIterations,
         apiKey: this.input.apiKey,
         baseUrl: this.input.baseUrl,
         strict: this.input.strict,
         results: this.buffered,
       };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (!this.runId) {
const reportInput: ReportEvalResultsInput = {
suiteName: this.input.suiteName,
suiteDescription: this.input.suiteDescription,
serverNames: this.input.serverNames,
notes: this.input.notes,
passCriteria: this.input.passCriteria,
externalRunId: this.externalRunId,
ci: this.withoutCiProvider(this.input.ci),
apiKey: this.input.apiKey,
baseUrl: this.input.baseUrl,
strict: this.input.strict,
results: this.buffered,
};
if (!this.runId) {
const reportInput: ReportEvalResultsInput = {
suiteName: this.input.suiteName,
suiteDescription: this.input.suiteDescription,
serverNames: this.input.serverNames,
notes: this.input.notes,
passCriteria: this.input.passCriteria,
externalRunId: this.externalRunId,
ci: this.withoutCiProvider(this.input.ci),
expectedIterations: this.expectedIterations,
apiKey: this.input.apiKey,
baseUrl: this.input.baseUrl,
strict: this.input.strict,
results: this.buffered,
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/src/eval-run-reporter.ts` around lines 267 - 280, The finalize() path
builds reportInput without preserving this.expectedIterations, so small/one-shot
runs lose expectedIterations set via setExpectedIterations(); update finalize()
to include this.expectedIterations in the ReportEvalResultsInput (same field
name used by startEvalRun()/flush()) before calling reportEvalResults(),
ensuring reportInput contains expectedIterations alongside suiteName,
serverNames, results, etc., so metadata matches the flush/startEvalRun path.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Mar 9, 2026
@chelojimenez chelojimenez merged commit 211ffaf into main Mar 9, 2026
22 of 24 checks passed
@chelojimenez chelojimenez deleted the sdk-ci-cd branch March 9, 2026 04:37
chelojimenez added a commit that referenced this pull request Mar 9, 2026
- Restore "Evals" sidebar entry (#evals) alongside new "CI Evals" (#ci-evals)
- Add missing selectedIterationId and onSelectIteration props to RunDetailView
  in SuiteIterationsView (required after #1541 refactored RunDetailView)
- Add iteration auto-selection and URL-based routing for the original Evals tab,
  matching the behavior already implemented in CiSuiteDetail

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
chelojimenez added a commit that referenced this pull request Mar 9, 2026
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants