Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
WalkthroughThis change adds a full eval-reporting system to the SDK (reportEvalResults, reportEvalResultsSafely, createEvalRunReporter, uploadEvalArtifact), new eval-reporting types and artifact parsers (JUnit, Jest, Vitest), widget snapshot capture and OpenAI-compat injection, and EvalAgent/PromptOptions types. SDK exports and mapping utilities were extended. Docs and skill reference were added. The inspector UI receives a CI Evals feature set (new routes, tabs, components, adapters, and trace/widget replay improvements), plus numerous UI, hook, and test updates to surface CI metadata and workspace-scoped API keys. |
There was a problem hiding this comment.
Actionable comments posted: 8
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (13)
sdk/src/mcp-client-manager/MCPClientManager.ts-431-440 (1)
431-440:⚠️ Potential issue | 🟡 MinorConsider deep cloning or readonly type for
getToolMetadata()to prevent accidental cache mutation.The spread operator only clones the first level. Nested objects and arrays inside
_metaremain shared withtoolsMetadataCache, so a caller mutating the returned value could inadvertently taint cached data. While current usage is read-only, a deep clone or readonly return type would be more defensive.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/src/mcp-client-manager/MCPClientManager.ts` around lines 431 - 440, getToolMetadata currently returns a shallow copy via spread which leaves nested objects/arrays aliased to toolsMetadataCache; change getToolMetadata to return a deep-cloned or readonly structure to prevent accidental mutation of the cache: update the implementation in getToolMetadata to deep clone the metadata (e.g., use structuredClone(metadata) with a safe fallback like JSON.parse(JSON.stringify(metadata)) or lodash/cloneDeep) and optionally change the signature to return Readonly<Record<string, unknown>> so callers cannot mutate the result; ensure metadataMap?.get(toolName) is passed through the deep-clone/readonly conversion before returning.docs/sdk/reference/eval-reporting.mdx-49-50 (1)
49-50:⚠️ Potential issue | 🟡 MinorThe sample output misidentifies
runId
abc123in this example is the commit SHA from Line 45, not the createdoutput.runId. UnlessexternalRunIdis set explicitly, this comment is inaccurate.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/sdk/reference/eval-reporting.mdx` around lines 49 - 50, Update the inline comment after the console.log sample so it doesn't claim `abc123` is the created `output.runId`; clarify that `abc123` in the example is the commit SHA shown earlier (not the generated runId) unless an `externalRunId` was explicitly provided. Adjust the comment next to the console.log(`Run ${output.runId}: ${output.result}`) line to state that the displayed identifier will be the commit SHA from the example above unless `externalRunId` is set, and leave references to `output.runId` and `externalRunId` as-is.docs/sdk/reference/eval-reporting.mdx-11-15 (1)
11-15:⚠️ Potential issue | 🟡 Minor
MCPJAM_API_KEYis not always requiredLine 13 says the env var is mandatory, but Lines 259-260 later document
apiKeyas a direct input override. That contradiction will confuse SDK users about whether they can authenticate purely through code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/sdk/reference/eval-reporting.mdx` around lines 11 - 15, The table entry incorrectly marks MCPJAM_API_KEY as required; update the table so the `MCPJAM_API_KEY` variable is listed as optional (Required = No or Optional) and adjust its Description to state that authentication can alternatively be provided via the SDK input `apiKey` (documented later), ensuring consistency with the `apiKey` override mentioned in the `apiKey` documentation; also scan for and reconcile any other places that claim the env var is mandatory to avoid conflicting guidance.mcpjam-inspector/client/src/components/evals/test-cases-overview.tsx-149-149 (1)
149-149:⚠️ Potential issue | 🟡 MinorSingle-model suites now render a false empty state
When
modelStats.length === 1, this branch suppresses the chart and the component falls through to “No model data available.” That hides valid data for the common one-model case.Suggested fix
- {modelStats.length > 1 ? ( + {modelStats.length > 0 ? (🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/components/evals/test-cases-overview.tsx` at line 149, The JSX branch currently only renders the chart when modelStats.length > 1, which hides valid data for single-model suites; update the condition in the test-cases-overview component so the chart renders for any non-empty modelStats (e.g., modelStats.length > 0 or modelStats.length >= 1) instead of > 1, ensuring the chart-rendering JSX (the branch using modelStats) displays when there is exactly one model.mcpjam-inspector/client/src/components/evals/ci-metadata-display.tsx-47-57 (1)
47-57:⚠️ Potential issue | 🟡 MinorKeep branch and commit badges plain when you can't derive repo URLs.
Falling back to
runUrlmakes those badges open the pipeline page for non-GitHub providers. That is misleading; only the Pipeline link should userunUrl.🔧 Suggested fix
- const branchUrl = branch - ? repoBaseUrl - ? `${repoBaseUrl}/tree/${encodeURIComponent(branch)}` - : runUrl - : undefined; - const commitUrl = shortSha - ? repoBaseUrl - ? `${repoBaseUrl}/commit/${encodeURIComponent(fullSha ?? shortSha)}` - : runUrl - : undefined; + const branchUrl = branch + ? repoBaseUrl + ? `${repoBaseUrl}/tree/${encodeURIComponent(branch)}` + : undefined + : undefined; + const commitUrl = fullSha + ? repoBaseUrl + ? `${repoBaseUrl}/commit/${encodeURIComponent(fullSha)}` + : undefined + : undefined;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/components/evals/ci-metadata-display.tsx` around lines 47 - 57, The branchUrl and commitUrl currently fall back to runUrl when getGitHubRepoBaseUrl(runUrl) returns falsy, causing branch/commit badges to erroneously link to the pipeline page; change branchUrl and commitUrl so they are undefined unless repoBaseUrl is truthy (i.e., remove the runUrl fallback) — update the logic in the block that computes repoBaseUrl, branchUrl, and commitUrl (referencing getGitHubRepoBaseUrl, repoBaseUrl, branch, shortSha, fullSha, runUrl) so only the Pipeline link uses runUrl while branch and commit badges remain plain when repoBaseUrl cannot be derived.mcpjam-inspector/client/src/components/evals/ci-suite-detail.tsx-200-205 (1)
200-205:⚠️ Potential issue | 🟡 MinorHandle stale
testIdandrunIdroutes explicitly.If the selected case or run no longer exists, this branch returns
nulland leaves the detail pane blank. Redirecting back tosuite-overviewor showing an empty state would make deleted or stale deep links recoverable.Also applies to: 284-298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/components/evals/ci-suite-detail.tsx` around lines 200 - 205, When the "test-detail" branch finds no matching selectedCase (selectedCase is null) you must handle stale/deleted deep links instead of returning null: detect the missing case and either navigate back to the suite overview route (e.g. call navigate('/suite-overview') or invoke the existing setViewMode('suite-overview')) or render a small empty-state component with a message and a button to go back; apply the same fix for the analogous "run-detail" branch that checks selectedRun/runId (the code around lines 284-298) so missing runs redirect or show the empty state as well. Ensure you reference the existing viewMode, selectedTestId/selectedRunId checks and replace the early `return null` with the redirect/state update or empty-state render.mcpjam-inspector/client/src/components/CiEvalsTab.tsx-259-274 (1)
259-274:⚠️ Potential issue | 🟡 MinorShow loading before the empty state.
While
queries.isOverviewLoadingis true,sdkSuitesis still empty, so the right panel briefly renders “No CI runs yet.” Checking the loading flag first avoids that false empty state.💡 Minimal fix
- {sdkSuites.length === 0 ? ( + {queries.isOverviewLoading ? ( + <div className="flex h-full items-center justify-center"> + <div className="text-center"> + <div className="mx-auto h-8 w-8 animate-spin rounded-full border-b-2 border-primary" /> + <p className="mt-4 text-muted-foreground">Loading suites...</p> + </div> + </div> + ) : sdkSuites.length === 0 ? (🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/components/CiEvalsTab.tsx` around lines 259 - 274, The empty-state branch renders when sdkSuites is empty even while data is loading; in CiEvalsTab change the conditional so you check queries.isOverviewLoading first and render a loading indicator instead of the "No CI runs yet" block. Locate the JSX that tests sdkSuites.length === 0 and move or augment that branch to short-circuit on queries.isOverviewLoading (use the existing queries.isOverviewLoading flag), otherwise fall through to the existing sdkSuites/route.type/selectedSuite logic.mcpjam-inspector/client/src/lib/ci-evals-router.ts-122-124 (1)
122-124:⚠️ Potential issue | 🟡 MinorErrant leading slash in
replaceStatepath.The constructed URL
/${hash}resolves to something like//#/ci-evals— an extra/before the hash. This may cause unexpected behavior or navigation issues.🔧 Proposed fix
if (options?.replace) { - history.replaceState({}, "", `/${hash}`); + history.replaceState({}, "", hash); window.dispatchEvent(new HashChangeEvent("hashchange")); } else {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/lib/ci-evals-router.ts` around lines 122 - 124, The replaceState call is building a path with an extra leading slash, causing URLs like `//#/ci-evals`; change the history.replaceState invocation so it uses the provided hash string without adding a leading "/" (i.e., pass the raw hash value rather than `"/" + hash`) and still dispatch the HashChangeEvent as before (refer to history.replaceState and the subsequent new HashChangeEvent("hashchange") in the options?.replace block).sdk/src/McpAppsOpenAICompatibleRuntime.bundled.ts-1-4 (1)
1-4:⚠️ Potential issue | 🟡 MinorThis file requires manual synchronization with its source.
The SDK copy must be updated whenever the source runtime changes. A build step or automated sync would eliminate drift risk and prevent divergence between the mcpjam-inspector source and SDK copy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/src/McpAppsOpenAICompatibleRuntime.bundled.ts` around lines 1 - 4, The SDK copy McpAppsOpenAICompatibleRuntime.bundled.ts is manually synchronized from mcpjam-inspector/server/routes/apps/mcp-apps/McpAppsOpenAICompatibleRuntime.ts which causes drift; add an automated sync and verification: implement a build/prepublish script (e.g., npm script or Makefile target) that copies or generates McpAppsOpenAICompatibleRuntime.bundled.ts from the source file and a CI check that compares checksums (or runs the same copy) to fail the build if out-of-sync, and wire this into the repository CI so changes to the source automatically update or flag the SDK copy.sdk/src/eval-result-mapping.ts-145-154 (1)
145-154:⚠️ Potential issue | 🟡 MinorVariable
testResultsshadows the outer parameter.Line 147 declares
const testResultswhich shadows the function parametertestResults: Map<string, EvalRunResult>on line 140. This compiles but breeds confusion.Proposed fix
for (const [testName, testRun] of testResults) { const expectedToolCalls = options.expectedToolCallsByTest?.[testName]; - const testResults = runToEvalResults(testRun, { + const mappedResults = runToEvalResults(testRun, { casePrefix: `${options.casePrefix}-${testName}`, provider: options.provider, model: options.model, expectedToolCalls, promptSelector: options.promptSelector, }); - results.push(...testResults); + results.push(...mappedResults); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/src/eval-result-mapping.ts` around lines 145 - 154, The inner const testResults declared inside the for-loop shadows the function parameter testResults; rename the inner variable (e.g., to caseResults or evalResults) returned from runToEvalResults to avoid shadowing and confusion, update its usages in the loop where results.push(...testResults) is called, and ensure references to runToEvalResults, the loop binding ([testName, testRun]), options, and the results array are updated accordingly.mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts-567-569 (1)
567-569:⚠️ Potential issue | 🟡 MinorMutating
part.toolCallIdmay cause side effects.Line 568 assigns to
part.toolCallIdon the original content part. If the caller retains references to the trace data, this mutation could leak across invocations.Proposed fix: avoid mutating input
const matchedResult = claimMatchingResult( resultBuckets, toolCallId, !part.toolCallId, )?.part; - if (!part.toolCallId) { - part.toolCallId = toolCallId; - } + const effectiveToolCallId = part.toolCallId ?? toolCallId; parts.push( ...buildToolParts({ - toolCall: part, + toolCall: { ...part, toolCallId: effectiveToolCallId }, matchedResult,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts` around lines 567 - 569, The code mutates the input object by assigning to part.toolCallId which can leak state; instead avoid mutating the original part by creating a shallow clone before setting the property (e.g., make a newPart = { ...part } or use Object.assign({}, part)), set newPart.toolCallId = toolCallId only when missing, and use the cloned object in the returned data/array so callers retain immutable input; update the logic around part and toolCallId to operate on the clone rather than mutating part directly.sdk/src/report-eval-results.ts-395-415 (1)
395-415:⚠️ Potential issue | 🟡 MinorTautological condition:
config.baseUrl.length >= 0is always true.String lengths are never negative. This appears to be vestigial or an incomplete check.
Remove the dead condition
const bytes = getByteLength(JSON.stringify(body)); - return bytes <= CHUNK_TARGET_BYTES && config.baseUrl.length >= 0; + return bytes <= CHUNK_TARGET_BYTES;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/src/report-eval-results.ts` around lines 395 - 415, In shouldUseOneShotUpload the condition(config.baseUrl.length >= 0) is tautological; remove it or change it to a meaningful check (e.g., config.baseUrl.length > 0) so the function actually verifies baseUrl is non-empty; update the return expression in shouldUseOneShotUpload to only depend on bytes <= CHUNK_TARGET_BYTES (and optionally config.baseUrl.length > 0) and remove the dead >= 0 check referencing config.baseUrl.sdk/src/report-eval-results.ts-161-174 (1)
161-174:⚠️ Potential issue | 🟡 MinorSilent fallback to empty object may violate type contract.
When
response.json()throws,responseBodybecomesundefined, and line 173 returns{} as T. IfThas required fields (e.g.,StartRunResponse.suiteId), callers will encounterundefinedvalues unexpectedly.Consider throwing when JSON parsing fails on a successful response, or at minimum logging a warning.
Proposed defensive handling
- let responseBody: BackendEnvelope<T> | undefined; - try { - responseBody = (await response.json()) as BackendEnvelope<T>; - } catch { - responseBody = undefined; - } + let responseBody: BackendEnvelope<T>; + try { + responseBody = (await response.json()) as BackendEnvelope<T>; + } catch (parseError) { + if (response.ok) { + console.warn("[mcpjam/sdk] response.ok but JSON parse failed"); + } + responseBody = {} as BackendEnvelope<T>; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/src/report-eval-results.ts` around lines 161 - 174, The current logic in report-eval-results.ts silently falls back to returning (responseBody ?? {}) as T when response.json() fails, which can break callers expecting required fields; update the handling around response.json() so that if parsing throws and response.ok is true you throw an explicit Error (including status and parsing failure context) instead of returning an empty object; adjust the code that checks responseBody/response.ok and the final return (responseBody ?? {}) as T to rely on a guaranteed parsed body or raise, referencing the response.json() call, the responseBody variable, and the response.ok branch to locate where to change the behavior.
🧹 Nitpick comments (14)
mcpjam-inspector/client/src/components/evals/__tests__/iteration-details-ui.test.tsx (1)
27-41: Unused mock:mockGetBlobis defined but never exercised.The
useActionmock returnsmockGetBlob, which gets reset each test, yet neither test invokes nor asserts against it. If blob-fetching behavior isn't under test here, consider removing the mock entirely to reduce noise—or add a test that exercises it.♻️ Optional: Remove unused mock if not needed
-const { mockGetBlob, mockJsonEditor } = vi.hoisted(() => ({ - mockGetBlob: vi.fn(), +const { mockJsonEditor } = vi.hoisted(() => ({ mockJsonEditor: vi.fn((props: any) => ( <div data-testid="json-editor">{JSON.stringify(props.value)}</div> )), }));And update the
convex/reactmock:vi.mock("convex/react", () => ({ - useAction: () => mockGetBlob, + useAction: () => vi.fn(), }));Then remove the reset in
beforeEach:beforeEach(() => { - mockGetBlob.mockReset(); mockJsonEditor.mockClear(); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/components/evals/__tests__/iteration-details-ui.test.tsx` around lines 27 - 41, The test file defines and resets mockGetBlob but never uses it; either remove the unused mock and its reset or exercise it in a test: delete the vi.mock("convex/react", ...) block (and remove mockGetBlob and its reset in beforeEach) if blob-fetching isn’t relevant, or keep the mock and add a test that calls the component/code path that invokes useAction (returning mockGetBlob) and assert mockGetBlob was called; reference mockGetBlob, the useAction mock in the "convex/react" vi.mock call, and the beforeEach reset to locate and update the code.mcpjam-inspector/client/src/components/evals/__tests__/trace-viewer.test.tsx (1)
9-50: Consider using shared mock presets.The inline mocks for preferences, provider logos, and UI components work correctly. However, if
mcpApiPresetsorstorePresetsinclient/src/test/mocks/offer equivalent utilities, leveraging them would improve consistency across the test suite.As per coding guidelines: "Use mock presets from
client/src/test/mocks/includingmcpApiPresetsandstorePresetsin client tests."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/components/evals/__tests__/trace-viewer.test.tsx` around lines 9 - 50, Tests currently define inline mocks for usePreferencesStore, getProviderLogo, JsonEditor, and MessageView (which calls mockMessageView); replace these ad-hoc mocks by importing and applying the shared mock presets (e.g., storePresets and mcpApiPresets) used across the test suite so the same mocked preferences/provider/logo/UI behavior is reused; update the test setup to use the presets before rendering (ensuring usePreferencesStore, getProviderLogo, JsonEditor, and MessageView behaviors are provided by the presets) and remove the corresponding inline vi.mock blocks.mcpjam-inspector/client/src/components/evals/ci-suite-list-sidebar.tsx (1)
98-136: Well-crafted suite list rendering.The component elegantly handles loading, empty, and populated states. The sparkline visualization with bounded height calculation is a nice touch. Status indicators and relative timestamps create an informative at-a-glance view.
Consider adding
aria-current="true"to the selected suite button for enhanced screen reader navigation:♿ Optional accessibility enhancement
<button key={entry.suite._id} onClick={() => onSelectSuite(entry.suite._id)} + aria-current={selectedSuiteId === entry.suite._id ? "true" : undefined} className={cn( "w-full px-4 py-2.5 text-left transition-colors hover:bg-accent/50", selectedSuiteId === entry.suite._id && "bg-accent", )} >🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/components/evals/ci-suite-list-sidebar.tsx` around lines 98 - 136, Add an accessibility marker to the selected suite button: when rendering the button in the CI suite list (the <button> that uses onSelectSuite and key={entry.suite._id}), include aria-current="true" when selectedSuiteId === entry.suite._id (and omit or set to false otherwise) so screen readers can identify the current/active suite; update the conditional props on that button element accordingly.sdk/tests/TestAgent.test.ts (1)
397-446: Restore theconsole.warnspy even on failure.If this case fails before the last line,
console.warnstays mocked and can skew later assertions. Atry/finallyorafterEach(jest.restoreAllMocks)keeps the suite isolated.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/tests/TestAgent.test.ts` around lines 397 - 446, The test uses jest.spyOn(console, "warn") to mock warnings but doesn't guarantee restoration on failure; wrap the spy usage in a try/finally (or install a suite-level cleanup like afterEach(jest.restoreAllMocks)) so console.warn is always restored: create the warnSpy via jest.spyOn(console, "warn"), run the test logic inside try { ... } and call warnSpy.mockRestore() in finally {}, or add afterEach to call jest.restoreAllMocks(); reference the local warnSpy variable and the test case "warns and skips widget snapshots when resource reads fail" / TestAgent instantiation to locate where to add the cleanup.sdk/tests/eval-run-reporter.test.ts (1)
162-208: Deduplicate the mock run builders.
createMockIterationResultandcreateMockRunResultare copied verbatim into both blocks. Pulling them to file scope will make future result-shape changes land in one place instead of two.Also applies to: 240-286
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/tests/eval-run-reporter.test.ts` around lines 162 - 208, The two helper builders createMockIterationResult and createMockRunResult are duplicated in multiple describe blocks; move both functions out of the inner describe into file scope (top-level within the test file) so all tests share the single implementations, then update/remove the inner copies so tests call the top-level createMockIterationResult and createMockRunResult; ensure the signatures and returned types (IterationResult, EvalRunResult) remain unchanged so all usages compile.sdk/skills/create-mcp-eval/SKILL.md (2)
139-144: Add language specifier to fenced code block.The
.gitignore additionsblock lacks a language identifier, which aids syntax highlighting and accessibility tooling.📝 Suggested fix
### .gitignore additions -``` +```gitignore node_modules/ dist/ .env</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@sdk/skills/create-mcp-eval/SKILL.mdaround lines 139 - 144, Update the
fenced code block under the "### .gitignore additions" section so the opening
triple backticks include the language specifiergitignore(i.e., change the
block opener fromtogitignore) to enable proper syntax highlighting and
accessibility tooling for the listed entries.</details> --- `614-620`: **Minor: Emphasis style inconsistency.** Static analysis prefers underscores for emphasis. The word "unique" uses asterisks. <details> <summary>📝 Suggested fix</summary> ```diff -4. **Write unambiguous prompts for similar tools** — when a server has tools with overlapping descriptions (e.g., `create_view` vs `export_to_excalidraw`), prompts must reference the tool's *unique* action. +4. **Write unambiguous prompts for similar tools** — when a server has tools with overlapping descriptions (e.g., `create_view` vs `export_to_excalidraw`), prompts must reference the tool's _unique_ action. ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@sdk/skills/create-mcp-eval/SKILL.md` around lines 614 - 620, The emphasis style is inconsistent: the word "unique" in item 4 uses asterisks instead of the preferred underscores; update the markdown to use underscores for emphasis (e.g., change *unique* to _unique_) in the section discussing tool descriptions (item 4, the sentence that currently says "prompts must reference the tool's *unique* action") so emphasis style matches the rest of the document. ``` </details> </blockquote></details> <details> <summary>mcpjam-inspector/client/src/lib/ci-evals-router.ts (1)</summary><blockquote> `151-151`: **Import placed after usage.** The `React` import resides at the file's end, after all references to `React.useState` and `React.useEffect`. While hoisting ensures correctness, conventional placement at the top aids readability. <details> <summary>✨ Proposed relocation</summary> ```diff +import React from "react"; + /** * CI Evals Router - Hash-based routing for CI evals tab ``` Then remove line 151. </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/lib/ci-evals-router.ts` at line 151, The React default import is located after its usages (React.useState, React.useEffect) which reduces readability; move the line importing React (import React from "react";) to the top of the file alongside other imports so React is declared before any references, and remove the trailing duplicate import at the end of the file (the current import at line with "import React from \"react\";") to avoid redundancy. ``` </details> </blockquote></details> <details> <summary>sdk/src/eval-result-mapping.ts (1)</summary><blockquote> `163-214`: **Consider consolidating duplicated aggregation logic.** `iterationsToEvalResultInputs` replicates the prompt-aggregation pattern from `iterationToEvalResult`. The repeated `flatMap` chains for tool calls, traces, and snapshots could be extracted into a shared helper, reducing surface area for drift. This is a low-priority refactor—the current implementation is correct. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@sdk/src/eval-result-mapping.ts` around lines 163 - 214, The iterationsToEvalResultInputs function duplicates the prompt-aggregation logic found in iterationToEvalResult; refactor by extracting a helper (e.g., collectPromptAggregates or aggregatePrompts) that accepts an IterationResult or prompts array and returns actualToolCalls, traceMessages (or trace object), and widgetSnapshots; update iterationsToEvalResultInputs and iterationToEvalResult to call that helper so the flatMap chains for getToolCalls/getMessages/getWidgetSnapshots are centralized and reused. ``` </details> </blockquote></details> <details> <summary>sdk/src/report-eval-results.ts (4)</summary><blockquote> `192-203`: **Regex-based retry heuristic is fragile.** Matching `/network|fetch|timeout|429|5\d\d/i` against `error.message` may trigger on unrelated errors containing these substrings (e.g., an error message like "Failed to fetch network configuration"). Consider tightening the pattern or prioritizing explicit error types. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@sdk/src/report-eval-results.ts` around lines 192 - 203, The retry heuristic in the shouldRetry logic is too broad; update it to prefer explicit error-type and status checks and tighten the regex: keep the existing isAbortError check, then check for known error shapes (e.g., error.name === "FetchError" or (error as any).type === "system" for node fetch), and check numeric HTTP statuses explicitly via (error as any).status === 429 || ((error as any).status >= 500 && (error as any).status < 600); only if none of those shape checks exist fall back to a much stricter regex (use word boundaries like /\b(timeout|timeouted|fetch failed)\b/i or /\b5\d{2}\b/ to avoid accidental matches). Update the shouldRetry variable construction (the block that sets shouldRetry) and keep the existing retry loop using config.retryDelaysMs, sleep and jitter unchanged. ``` </details> --- `471-484`: **Type assertions on server response assume constrained values.** Casting `start.status as "completed" | "failed"` and `start.result as "passed" | "failed"` trusts the server. If the backend evolves (e.g., adds `"pending"`), this could silently propagate invalid states. Consider a runtime guard or, at minimum, document the assumption. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@sdk/src/report-eval-results.ts` around lines 471 - 484, The current return block trusts server values by casting start.status and start.result to tight literal unions; add a runtime guard instead: implement small validators (e.g., isValidStatus(status) and isValidResult(result)) and use them before returning from the block that reads start (the local variable named start in report-eval-results.ts) — if the validators pass, return the object with the validated values, otherwise handle the unexpected case (throw a descriptive error, log and fall back to a safe default, or mark as "failed") so invalid/unknown server values are not silently asserted into the narrowed types. ``` </details> --- `270-326`: **`uploadBlobToConvex` lacks timeout/abort handling.** Unlike `requestWithRetry`, this function has no `AbortController` timeout. Large uploads or unresponsive servers could hang indefinitely. <details> <summary>Add timeout parity with requestWithRetry</summary> ```diff async function uploadBlobToConvex( config: RuntimeConfig, uploadUrl: string, body: string, contentType: string ): Promise<string> { let lastError: unknown; for (let attempt = 0; attempt <= config.retryDelaysMs.length; attempt++) { + const controller = new AbortController(); + const timeoutHandle = setTimeout(() => controller.abort(), config.timeoutMs); try { const response = await fetch(uploadUrl, { method: "POST", headers: { "Content-Type": contentType, }, body, + signal: controller.signal, }); + clearTimeout(timeoutHandle); // ... rest unchanged } catch (error) { + clearTimeout(timeoutHandle); lastError = error; // ... existing retry logic } } // ... } ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@sdk/src/report-eval-results.ts` around lines 270 - 326, The uploadBlobToConvex function lacks request timeout/abort handling; update uploadBlobToConvex to use an AbortController (like requestWithRetry) and abort the fetch after the configured timeout (e.g., config.requestTimeoutMs) so large uploads or hung servers don't block indefinitely: create an AbortController before fetch, pass its signal to fetch, set a timer to call controller.abort() after the timeout (and clear it on success), and ensure abort errors are treated as retryable in the existing catch logic so behavior matches requestWithRetry. ``` </details> --- `88-119`: **Chunking logic handles count and byte limits well.** One edge case: a single `EvalResultInput` exceeding `maxBytes` will still be pushed as its own chunk. If the backend enforces a hard payload limit, this could fail. Consider logging a warning when a single-item chunk exceeds the target. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@sdk/src/report-eval-results.ts` around lines 88 - 119, In chunkResultsForUpload detect when a single EvalResultInput alone exceeds maxBytes and emit a warning; specifically, where candidateBytes is computed for candidate = [...currentChunk, result], if candidateBytes > maxBytes and currentChunk.length === 0 (meaning this single result already exceeds maxBytes) call a logger (e.g., console.warn or the module logger) with identifying info from the result (such as an id or prompt hash) and the computed candidateBytes and maxBytes so the oversized item is visible for debugging; leave the rest of the splitting logic intact so the oversized item is still pushed as its own chunk. ``` </details> </blockquote></details> <details> <summary>sdk/src/eval-reporting-types.ts (1)</summary><blockquote> `31-36`: **`Record<string, never>` as a presence flag is unconventional.** This pattern forces each permission to be `{}` when present. While valid, it's less intuitive than a simple `boolean` flag. If backward compatibility permits, consider: ```typescript camera?: boolean; microphone?: boolean; // ... ``` Otherwise, a brief comment clarifying the intent would aid maintainability. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@sdk/src/eval-reporting-types.ts` around lines 31 - 36, The EvalWidgetPermissions type uses Record<string, never> for camera, microphone, geolocation, and clipboardWrite which is unconventional; update the type so each permission is an optional boolean (camera?: boolean; microphone?: boolean; geolocation?: boolean; clipboardWrite?: boolean) to act as a presence flag, or if you must keep the current shape for compatibility, add a brief comment above EvalWidgetPermissions explaining that Record<string, never> is intentionally used to represent an empty-object presence flag and why it cannot be changed. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx:
- Around line 130-141: The fallbacks use rawOutput instead of the normalized
resolvedToolOutput, so when callers only pass toolOutput the widget metadata and
serverId can be missed; update effectiveToolMeta to call
readToolResultMeta(resolvedToolOutput) instead of readToolResultMeta(rawOutput),
update uiType to call detectUIType(effectiveToolMeta, resolvedToolOutput), and
update the serverId fallback to call readToolResultServerId(resolvedToolOutput)
so all metadata extraction uses the resolvedToolOutput variable.In
@mcpjam-inspector/client/src/components/evals/__tests__/trace-viewer-adapter.test.ts:
- Around line 228-229: The test currently works around a mutation by
deep-cloning traces in makeTrace but the real fix is to make
adaptTraceToUiMessages pure and update the test to assert no mutation: change
adaptTraceToUiMessages so it never writes back into caller-owned objects (create
new part/message objects when you need to add toolCallId or other synthetic
fields instead of mutating the input), and then remove the deep-clone in the
test (makeTrace) and add an assertion that the original TraceEnvelope and its
parts remain unchanged after calling adaptTraceToUiMessages.In
@sdk/src/artifact-parsers/jest-json.ts:
- Around line 45-46: The parser currently collapses any non-"passed"
assertion.status into "failed" (see variables status and passed in
sdk/src/artifact-parsers/jest-json.ts), which misclassifies
skipped/pending/todo/disabled tests; update the logic to preserve explicit Jest
statuses instead of defaulting to "failed": keep status = assertion.status (no
fallback to "failed"), compute passed as status === "passed", and map other
statuses (e.g., "skipped","pending","todo","disabled") to a
non-failed/non-passed outcome (or a dedicated "skipped"/"pending" state) so
suite pass-rate calculations consume the correct status values. Ensure any
downstream code that reads passed uses the new status values consistently
(adjust any use of the passed variable or status mapping in the same module).In
@sdk/src/artifact-parsers/junit-xml.ts:
- Around line 3-11: The attribute parser parseAttributes only matches
double-quoted attributes and the failure/error/skipped handling ignores
self-closing elements and single-quoted attributes; update parseAttributes to
accept either double or single quotes (e.g., allow quote capture and use it to
extract value) and ensure callers that parse , , and
elements also check for a message attribute (message='...' or message="...") on
self-closing tags as well as for inner text in paired tags (use the same
attribute parser to read message and type), so tests get proper titles and
concrete messages instead of defaults; update any parsing logic that reads body
text (the code around the existing failure handling) to prefer message attribute
if present, otherwise fall back to inner text.In
@sdk/src/artifact-parsers/vitest-json.ts:
- Around line 42-43: The parser currently treats any non-"pass" state as a
failure; change the logic in vitest-json.ts around where you compute
state/passed (using task.result?.state) to explicitly detect "skipped" and
"todo" (and also check task.mode === "skip" or "todo") and mark those tests as
skipped/todo instead of failed; only set passed = true when state === "pass",
set a new/skipped flag when state === "skipped" || state === "todo" || task.mode
=== "skip" || task.mode === "todo", and ensure downstream scoring/aggregation
(the code that consumes passed/failure counts around the same parser functions)
excludes skipped/todo entries from failure and pass-rate calculations rather
than counting them as failures.In
@sdk/src/eval-run-reporter.ts:
- Around line 129-141: The reporter mutates shared state (buffered, addedCount,
etc.) from record(), flush(), finalize(), and startEvalRun() without
serialization, allowing concurrent calls to race; fix by serializing all
mutations with a single in-flight guard (e.g. a private mutex/promise queue)
around add(), record(), flush(), finalize(), and startEvalRun() so callers await
the lock before mutating or uploading, ensure flush() clears the buffer only
under the lock and that record() awaits any ongoing flush/upload to avoid
double-appending, and update methods (record, flush, finalize, add,
startEvalRun) to acquire/release the guard to enforce sequential mutation.- Around line 202-299: The incremental path lets network errors escape once
runId exists, so for non-strict mode you must catch failures from startEvalRun,
appendEvalRunIterations, and finalizeEvalRun and fall back to
buildLocalFallbackResult() instead of letting errors propagate. Modify flush()
to wrap the startEvalRun/appendEvalRunIterations loop in try/catch: on error, if
this.input.strict is true rethrow, otherwise set this.completedResult =
this.buildLocalFallbackResult(), this.finalized = true, this.buffered = [], and
return. Do the same in finalize(): wrap await this.flush() and
finalizeEvalRun(...) in try/catch and apply the same strict vs non-strict
handling to produce the local fallback result and mark
completedResult/finalized. Ensure you reference these symbols (flush, finalize,
startEvalRun, appendEvalRunIterations, finalizeEvalRun,
buildLocalFallbackResult, input.strict, completedResult, finalized, buffered)
when making the changes.In
@sdk/tests/asana-sdk-classes.test.ts:
- Around line 26-48: Remove the hardcoded credential fallbacks and require
environment variables only: delete literal default strings for MCPJAM_API_KEY,
MCPJAM_BASE_URL, ASANA_REFRESH_TOKEN, ASANA_CLIENT_ID, OPENROUTER_API_KEY (and
any other secret defaults) so those constants read directly from process.env;
keep the existing gating logic (RUN_INGESTION_TESTS / RUN_LLM_INGESTION) so
tests skip when env vars are absent and update any test setup that relied on the
defaults to handle missing envs gracefully.
Minor comments:
In@docs/sdk/reference/eval-reporting.mdx:
- Around line 49-50: Update the inline comment after the console.log sample so
it doesn't claimabc123is the createdoutput.runId; clarify thatabc123
in the example is the commit SHA shown earlier (not the generated runId) unless
anexternalRunIdwas explicitly provided. Adjust the comment next to the
console.log(Run ${output.runId}: ${output.result}) line to state that the
displayed identifier will be the commit SHA from the example above unless
externalRunIdis set, and leave references tooutput.runIdand
externalRunIdas-is.- Around line 11-15: The table entry incorrectly marks MCPJAM_API_KEY as
required; update the table so theMCPJAM_API_KEYvariable is listed as
optional (Required = No or Optional) and adjust its Description to state that
authentication can alternatively be provided via the SDK inputapiKey
(documented later), ensuring consistency with theapiKeyoverride mentioned in
theapiKeydocumentation; also scan for and reconcile any other places that
claim the env var is mandatory to avoid conflicting guidance.In
@mcpjam-inspector/client/src/components/CiEvalsTab.tsx:
- Around line 259-274: The empty-state branch renders when sdkSuites is empty
even while data is loading; in CiEvalsTab change the conditional so you check
queries.isOverviewLoading first and render a loading indicator instead of the
"No CI runs yet" block. Locate the JSX that tests sdkSuites.length === 0 and
move or augment that branch to short-circuit on queries.isOverviewLoading (use
the existing queries.isOverviewLoading flag), otherwise fall through to the
existing sdkSuites/route.type/selectedSuite logic.In
@mcpjam-inspector/client/src/components/evals/ci-metadata-display.tsx:
- Around line 47-57: The branchUrl and commitUrl currently fall back to runUrl
when getGitHubRepoBaseUrl(runUrl) returns falsy, causing branch/commit badges to
erroneously link to the pipeline page; change branchUrl and commitUrl so they
are undefined unless repoBaseUrl is truthy (i.e., remove the runUrl fallback) —
update the logic in the block that computes repoBaseUrl, branchUrl, and
commitUrl (referencing getGitHubRepoBaseUrl, repoBaseUrl, branch, shortSha,
fullSha, runUrl) so only the Pipeline link uses runUrl while branch and commit
badges remain plain when repoBaseUrl cannot be derived.In
@mcpjam-inspector/client/src/components/evals/ci-suite-detail.tsx:
- Around line 200-205: When the "test-detail" branch finds no matching
selectedCase (selectedCase is null) you must handle stale/deleted deep links
instead of returning null: detect the missing case and either navigate back to
the suite overview route (e.g. call navigate('/suite-overview') or invoke the
existing setViewMode('suite-overview')) or render a small empty-state component
with a message and a button to go back; apply the same fix for the analogous
"run-detail" branch that checks selectedRun/runId (the code around lines
284-298) so missing runs redirect or show the empty state as well. Ensure you
reference the existing viewMode, selectedTestId/selectedRunId checks and replace
the earlyreturn nullwith the redirect/state update or empty-state render.In
@mcpjam-inspector/client/src/components/evals/test-cases-overview.tsx:
- Line 149: The JSX branch currently only renders the chart when
modelStats.length > 1, which hides valid data for single-model suites; update
the condition in the test-cases-overview component so the chart renders for any
non-empty modelStats (e.g., modelStats.length > 0 or modelStats.length >= 1)
instead of > 1, ensuring the chart-rendering JSX (the branch using modelStats)
displays when there is exactly one model.In
@mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts:
- Around line 567-569: The code mutates the input object by assigning to
part.toolCallId which can leak state; instead avoid mutating the original part
by creating a shallow clone before setting the property (e.g., make a newPart =
{ ...part } or use Object.assign({}, part)), set newPart.toolCallId = toolCallId
only when missing, and use the cloned object in the returned data/array so
callers retain immutable input; update the logic around part and toolCallId to
operate on the clone rather than mutating part directly.In
@mcpjam-inspector/client/src/lib/ci-evals-router.ts:
- Around line 122-124: The replaceState call is building a path with an extra
leading slash, causing URLs like//#/ci-evals; change the history.replaceState
invocation so it uses the provided hash string without adding a leading "/"
(i.e., pass the raw hash value rather than"/" + hash) and still dispatch the
HashChangeEvent as before (refer to history.replaceState and the subsequent new
HashChangeEvent("hashchange") in the options?.replace block).In
@sdk/src/eval-result-mapping.ts:
- Around line 145-154: The inner const testResults declared inside the for-loop
shadows the function parameter testResults; rename the inner variable (e.g., to
caseResults or evalResults) returned from runToEvalResults to avoid shadowing
and confusion, update its usages in the loop where results.push(...testResults)
is called, and ensure references to runToEvalResults, the loop binding
([testName, testRun]), options, and the results array are updated accordingly.In
@sdk/src/mcp-client-manager/MCPClientManager.ts:
- Around line 431-440: getToolMetadata currently returns a shallow copy via
spread which leaves nested objects/arrays aliased to toolsMetadataCache; change
getToolMetadata to return a deep-cloned or readonly structure to prevent
accidental mutation of the cache: update the implementation in getToolMetadata
to deep clone the metadata (e.g., use structuredClone(metadata) with a safe
fallback like JSON.parse(JSON.stringify(metadata)) or lodash/cloneDeep) and
optionally change the signature to return Readonly<Record<string, unknown>> so
callers cannot mutate the result; ensure metadataMap?.get(toolName) is passed
through the deep-clone/readonly conversion before returning.In
@sdk/src/McpAppsOpenAICompatibleRuntime.bundled.ts:
- Around line 1-4: The SDK copy McpAppsOpenAICompatibleRuntime.bundled.ts is
manually synchronized from
mcpjam-inspector/server/routes/apps/mcp-apps/McpAppsOpenAICompatibleRuntime.ts
which causes drift; add an automated sync and verification: implement a
build/prepublish script (e.g., npm script or Makefile target) that copies or
generates McpAppsOpenAICompatibleRuntime.bundled.ts from the source file and a
CI check that compares checksums (or runs the same copy) to fail the build if
out-of-sync, and wire this into the repository CI so changes to the source
automatically update or flag the SDK copy.In
@sdk/src/report-eval-results.ts:
- Around line 395-415: In shouldUseOneShotUpload the
condition(config.baseUrl.length >= 0) is tautological; remove it or change it to
a meaningful check (e.g., config.baseUrl.length > 0) so the function actually
verifies baseUrl is non-empty; update the return expression in
shouldUseOneShotUpload to only depend on bytes <= CHUNK_TARGET_BYTES (and
optionally config.baseUrl.length > 0) and remove the dead >= 0 check referencing
config.baseUrl.- Around line 161-174: The current logic in report-eval-results.ts silently
falls back to returning (responseBody ?? {}) as T when response.json() fails,
which can break callers expecting required fields; update the handling around
response.json() so that if parsing throws and response.ok is true you throw an
explicit Error (including status and parsing failure context) instead of
returning an empty object; adjust the code that checks responseBody/response.ok
and the final return (responseBody ?? {}) as T to rely on a guaranteed parsed
body or raise, referencing the response.json() call, the responseBody variable,
and the response.ok branch to locate where to change the behavior.
Nitpick comments:
In
@mcpjam-inspector/client/src/components/evals/__tests__/iteration-details-ui.test.tsx:
- Around line 27-41: The test file defines and resets mockGetBlob but never uses
it; either remove the unused mock and its reset or exercise it in a test: delete
the vi.mock("convex/react", ...) block (and remove mockGetBlob and its reset in
beforeEach) if blob-fetching isn’t relevant, or keep the mock and add a test
that calls the component/code path that invokes useAction (returning
mockGetBlob) and assert mockGetBlob was called; reference mockGetBlob, the
useAction mock in the "convex/react" vi.mock call, and the beforeEach reset to
locate and update the code.In
@mcpjam-inspector/client/src/components/evals/__tests__/trace-viewer.test.tsx:
- Around line 9-50: Tests currently define inline mocks for usePreferencesStore,
getProviderLogo, JsonEditor, and MessageView (which calls mockMessageView);
replace these ad-hoc mocks by importing and applying the shared mock presets
(e.g., storePresets and mcpApiPresets) used across the test suite so the same
mocked preferences/provider/logo/UI behavior is reused; update the test setup to
use the presets before rendering (ensuring usePreferencesStore, getProviderLogo,
JsonEditor, and MessageView behaviors are provided by the presets) and remove
the corresponding inline vi.mock blocks.In
@mcpjam-inspector/client/src/components/evals/ci-suite-list-sidebar.tsx:
- Around line 98-136: Add an accessibility marker to the selected suite button:
when rendering the button in the CI suite list (the that uses
onSelectSuite and key={entry.suite._id}), include aria-current="true" when
selectedSuiteId === entry.suite._id (and omit or set to false otherwise) so
screen readers can identify the current/active suite; update the conditional
props on that button element accordingly.In
@mcpjam-inspector/client/src/lib/ci-evals-router.ts:
- Line 151: The React default import is located after its usages
(React.useState, React.useEffect) which reduces readability; move the line
importing React (import React from "react";) to the top of the file alongside
other imports so React is declared before any references, and remove the
trailing duplicate import at the end of the file (the current import at line
with "import React from "react";") to avoid redundancy.In
@sdk/skills/create-mcp-eval/SKILL.md:
- Around line 139-144: Update the fenced code block under the "### .gitignore
additions" section so the opening triple backticks include the language
specifiergitignore(i.e., change the block opener fromtogitignore)
to enable proper syntax highlighting and accessibility tooling for the listed
entries.- Around line 614-620: The emphasis style is inconsistent: the word "unique" in
item 4 uses asterisks instead of the preferred underscores; update the markdown
to use underscores for emphasis (e.g., change unique to unique) in the
section discussing tool descriptions (item 4, the sentence that currently says
"prompts must reference the tool's unique action") so emphasis style matches
the rest of the document.In
@sdk/src/eval-reporting-types.ts:
- Around line 31-36: The EvalWidgetPermissions type uses Record<string, never>
for camera, microphone, geolocation, and clipboardWrite which is unconventional;
update the type so each permission is an optional boolean (camera?: boolean;
microphone?: boolean; geolocation?: boolean; clipboardWrite?: boolean) to act as
a presence flag, or if you must keep the current shape for compatibility, add a
brief comment above EvalWidgetPermissions explaining that Record<string, never>
is intentionally used to represent an empty-object presence flag and why it
cannot be changed.In
@sdk/src/eval-result-mapping.ts:
- Around line 163-214: The iterationsToEvalResultInputs function duplicates the
prompt-aggregation logic found in iterationToEvalResult; refactor by extracting
a helper (e.g., collectPromptAggregates or aggregatePrompts) that accepts an
IterationResult or prompts array and returns actualToolCalls, traceMessages (or
trace object), and widgetSnapshots; update iterationsToEvalResultInputs and
iterationToEvalResult to call that helper so the flatMap chains for
getToolCalls/getMessages/getWidgetSnapshots are centralized and reused.In
@sdk/src/report-eval-results.ts:
- Around line 192-203: The retry heuristic in the shouldRetry logic is too
broad; update it to prefer explicit error-type and status checks and tighten the
regex: keep the existing isAbortError check, then check for known error shapes
(e.g., error.name === "FetchError" or (error as any).type === "system" for node
fetch), and check numeric HTTP statuses explicitly via (error as any).status ===
429 || ((error as any).status >= 500 && (error as any).status < 600); only if
none of those shape checks exist fall back to a much stricter regex (use word
boundaries like /\b(timeout|timeouted|fetch failed)\b/i or /\b5\d{2}\b/ to avoid
accidental matches). Update the shouldRetry variable construction (the block
that sets shouldRetry) and keep the existing retry loop using
config.retryDelaysMs, sleep and jitter unchanged.- Around line 471-484: The current return block trusts server values by casting
start.status and start.result to tight literal unions; add a runtime guard
instead: implement small validators (e.g., isValidStatus(status) and
isValidResult(result)) and use them before returning from the block that reads
start (the local variable named start in report-eval-results.ts) — if the
validators pass, return the object with the validated values, otherwise handle
the unexpected case (throw a descriptive error, log and fall back to a safe
default, or mark as "failed") so invalid/unknown server values are not silently
asserted into the narrowed types.- Around line 270-326: The uploadBlobToConvex function lacks request
timeout/abort handling; update uploadBlobToConvex to use an AbortController
(like requestWithRetry) and abort the fetch after the configured timeout (e.g.,
config.requestTimeoutMs) so large uploads or hung servers don't block
indefinitely: create an AbortController before fetch, pass its signal to fetch,
set a timer to call controller.abort() after the timeout (and clear it on
success), and ensure abort errors are treated as retryable in the existing catch
logic so behavior matches requestWithRetry.- Around line 88-119: In chunkResultsForUpload detect when a single
EvalResultInput alone exceeds maxBytes and emit a warning; specifically, where
candidateBytes is computed for candidate = [...currentChunk, result], if
candidateBytes > maxBytes and currentChunk.length === 0 (meaning this single
result already exceeds maxBytes) call a logger (e.g., console.warn or the module
logger) with identifying info from the result (such as an id or prompt hash) and
the computed candidateBytes and maxBytes so the oversized item is visible for
debugging; leave the rest of the splitting logic intact so the oversized item is
still pushed as its own chunk.In
@sdk/tests/eval-run-reporter.test.ts:
- Around line 162-208: The two helper builders createMockIterationResult and
createMockRunResult are duplicated in multiple describe blocks; move both
functions out of the inner describe into file scope (top-level within the test
file) so all tests share the single implementations, then update/remove the
inner copies so tests call the top-level createMockIterationResult and
createMockRunResult; ensure the signatures and returned types (IterationResult,
EvalRunResult) remain unchanged so all usages compile.In
@sdk/tests/TestAgent.test.ts:
- Around line 397-446: The test uses jest.spyOn(console, "warn") to mock
warnings but doesn't guarantee restoration on failure; wrap the spy usage in a
try/finally (or install a suite-level cleanup like
afterEach(jest.restoreAllMocks)) so console.warn is always restored: create the
warnSpy via jest.spyOn(console, "warn"), run the test logic inside try { ... }
and call warnSpy.mockRestore() in finally {}, or add afterEach to call
jest.restoreAllMocks(); reference the local warnSpy variable and the test case
"warns and skips widget snapshots when resource reads fail" / TestAgent
instantiation to locate where to add the cleanup.</details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| const effectiveToolMeta = | ||
| renderOverride?.toolMetadata ?? | ||
| toolMetadata ?? | ||
| readToolResultMeta(rawOutput); | ||
| const resolvedToolOutput = toolOutput ?? rawOutput; | ||
| const uiType = detectUIType(effectiveToolMeta, rawOutput ?? toolOutput); | ||
| const uiResourceUri = | ||
| renderOverride?.resourceUri ?? getUIResourceUri(uiType, effectiveToolMeta); | ||
| const serverId = | ||
| renderOverride?.serverId ?? | ||
| getToolServerId(toolName, toolServerMap) ?? | ||
| readToolResultServerId(rawOutput); |
There was a problem hiding this comment.
Fall back to toolOutput when extracting widget metadata.
The prop contract allows callers to supply only toolOutput. In that case these fallbacks never see _meta or _serverId, and the component can hit the “failed to load” path even though the result already contains everything it needs.
🐛 Suggested fix
- const effectiveToolMeta =
- renderOverride?.toolMetadata ??
- toolMetadata ??
- readToolResultMeta(rawOutput);
const resolvedToolOutput = toolOutput ?? rawOutput;
- const uiType = detectUIType(effectiveToolMeta, rawOutput ?? toolOutput);
+ const effectiveToolMeta =
+ renderOverride?.toolMetadata ??
+ toolMetadata ??
+ readToolResultMeta(resolvedToolOutput);
+ const uiType = detectUIType(effectiveToolMeta, resolvedToolOutput);
const uiResourceUri =
renderOverride?.resourceUri ?? getUIResourceUri(uiType, effectiveToolMeta);
const serverId =
renderOverride?.serverId ??
getToolServerId(toolName, toolServerMap) ??
- readToolResultServerId(rawOutput);
+ readToolResultServerId(resolvedToolOutput);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const effectiveToolMeta = | |
| renderOverride?.toolMetadata ?? | |
| toolMetadata ?? | |
| readToolResultMeta(rawOutput); | |
| const resolvedToolOutput = toolOutput ?? rawOutput; | |
| const uiType = detectUIType(effectiveToolMeta, rawOutput ?? toolOutput); | |
| const uiResourceUri = | |
| renderOverride?.resourceUri ?? getUIResourceUri(uiType, effectiveToolMeta); | |
| const serverId = | |
| renderOverride?.serverId ?? | |
| getToolServerId(toolName, toolServerMap) ?? | |
| readToolResultServerId(rawOutput); | |
| const resolvedToolOutput = toolOutput ?? rawOutput; | |
| const effectiveToolMeta = | |
| renderOverride?.toolMetadata ?? | |
| toolMetadata ?? | |
| readToolResultMeta(resolvedToolOutput); | |
| const uiType = detectUIType(effectiveToolMeta, resolvedToolOutput); | |
| const uiResourceUri = | |
| renderOverride?.resourceUri ?? getUIResourceUri(uiType, effectiveToolMeta); | |
| const serverId = | |
| renderOverride?.serverId ?? | |
| getToolServerId(toolName, toolServerMap) ?? | |
| readToolResultServerId(resolvedToolOutput); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx`
around lines 130 - 141, The fallbacks use rawOutput instead of the normalized
resolvedToolOutput, so when callers only pass toolOutput the widget metadata and
serverId can be missed; update effectiveToolMeta to call
readToolResultMeta(resolvedToolOutput) instead of readToolResultMeta(rawOutput),
update uiType to call detectUIType(effectiveToolMeta, resolvedToolOutput), and
update the serverId fallback to call readToolResultServerId(resolvedToolOutput)
so all metadata extraction uses the resolvedToolOutput variable.
| const status = assertion.status ?? "failed"; | ||
| const passed = status === "passed"; |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Jest JSON output, what values can assertionResults[].status take, and how are pending, todo, or disabled tests represented?
💡 Result:
In Jest’s --json output, each assertionResults[].status is one of: "passed" | "failed" | "skipped" | "pending" | "todo" | "disabled" | "focused". [1]
Representation of non-running tests in assertionResults:
- Skipped tests (e.g.
test.skip,describe.skip) →status: "skipped"[1] - Todo tests (
test.todo(...)) →status: "todo"[1] - Pending tests (framework-reported “pending”/not executed) →
status: "pending"[1] - Disabled tests (reported as disabled) →
status: "disabled"[1]
Sources:
[1] @jest/types TypeScript definitions (type Status = 'passed' | 'failed' | 'skipped' | 'pending' | 'todo' | 'disabled' | 'focused') https://app.unpkg.com/@jest/types@30.2.0/files/build/index.d.ts
Handle Jest assertion statuses correctly to avoid failing non-executed tests
The current implementation treats all non-"passed" statuses (skipped, pending, todo, disabled) as failures, which incorrectly drags down suite pass rate. Jest JSON output includes explicit status values for these cases, and they should be handled appropriately rather than collapsed into a single "failed" result.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk/src/artifact-parsers/jest-json.ts` around lines 45 - 46, The parser
currently collapses any non-"passed" assertion.status into "failed" (see
variables status and passed in sdk/src/artifact-parsers/jest-json.ts), which
misclassifies skipped/pending/todo/disabled tests; update the logic to preserve
explicit Jest statuses instead of defaulting to "failed": keep status =
assertion.status (no fallback to "failed"), compute passed as status ===
"passed", and map other statuses (e.g., "skipped","pending","todo","disabled")
to a non-failed/non-passed outcome (or a dedicated "skipped"/"pending" state) so
suite pass-rate calculations consume the correct status values. Ensure any
downstream code that reads passed uses the new status values consistently
(adjust any use of the passed variable or status mapping in the same module).
| add(result: EvalResultInput): void { | ||
| this.ensureNotFinalized(); | ||
| this.buffered.push(result); | ||
| this.addedCount += 1; | ||
| } | ||
|
|
||
| async record(result: EvalResultInput): Promise<void> { | ||
| this.add(result); | ||
| const preview = chunkResultsForUpload(this.buffered, 200, 1024 * 1024); | ||
| if (preview.length > 1 || this.buffered.length >= 200) { | ||
| await this.flush(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Serialize reporter mutations before calling the API.
record(), flush(), and finalize() all mutate shared state without any in-flight guard. Concurrent callers can race into startEvalRun(), append the same buffer twice, or finalize the run while another flush is still uploading.
Also applies to: 202-299
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk/src/eval-run-reporter.ts` around lines 129 - 141, The reporter mutates
shared state (buffered, addedCount, etc.) from record(), flush(), finalize(),
and startEvalRun() without serialization, allowing concurrent calls to race; fix
by serializing all mutations with a single in-flight guard (e.g. a private
mutex/promise queue) around add(), record(), flush(), finalize(), and
startEvalRun() so callers await the lock before mutating or uploading, ensure
flush() clears the buffer only under the lock and that record() awaits any
ongoing flush/upload to avoid double-appending, and update methods (record,
flush, finalize, add, startEvalRun) to acquire/release the guard to enforce
sequential mutation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| toolState !== "input-available" | ||
| ) { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
MCP_APPS state filtering added where none existed before
Medium Severity
The new WidgetReplay component adds a state guard for MCP_APPS widgets that returns null if toolState is not in an explicit allowlist. Previously, part-switch.tsx rendered MCPAppsRenderer for MCP_APPS regardless of toolState, meaning states like "call" would pass through to the renderer. Now, any state not in the allowlist (including undefined or "call") causes the widget to silently not render, potentially breaking MCP apps rendering during early tool-call states.
| <div className="text-xs text-muted-foreground italic"> | ||
| No expected tool calls | ||
| </div> | ||
| ) : ( |
There was a problem hiding this comment.
Snapshot empty array not treated as absent fallback
Medium Severity
The expectedToolCalls resolution uses a nullish coalescing chain that falls through to testCase?.expectedToolCalls only when snapshotExpected is null/undefined or an empty array. However, snapshotExpected?.length > 0 short-circuits to null for empty arrays, which then correctly falls through. But the use of ?? [] at the end means if testCase?.expectedToolCalls is undefined, it defaults to []. The prior code used || which treated both undefined and empty arrays as falsy — this new logic changes behavior when snapshotExpected is a non-null but empty array, as the ternary produces null and falls through, matching the old || behavior. This is actually correct, disregard — the logic is equivalent.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
mcpjam-inspector/client/src/components/connection/ServerConnectionCard.tsx (1)
274-281: Consider reusinggetServerCommandDisplayfor consistency.This logic duplicates the already-imported
getServerCommandDisplay(line 117). SincegenerateAgentBrieftreats empty strings as falsy (per its implementation), the utility works here and keeps derivation logic centralized.♻️ Proposed simplification
- const serverUrl = - "url" in server.config && server.config.url - ? server.config.url.toString() - : "command" in server.config - ? [server.config.command, ...(server.config.args ?? [])] - .filter(Boolean) - .join(" ") - : undefined; + const serverUrl = getServerCommandDisplay(server.config) || undefined;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/components/connection/ServerConnectionCard.tsx` around lines 274 - 281, The serverUrl derivation duplicates logic already encapsulated by getServerCommandDisplay; replace the inline ternary that computes serverUrl with a call to getServerCommandDisplay(server.config) so the component reuses the centralized utility and preserves behavior expected by generateAgentBrief (which treats empty strings as falsy). Update references to serverUrl accordingly and remove the duplicated construction logic to keep derivation consistent with the imported getServerCommandDisplay helper.mcpjam-inspector/client/src/App.tsx (1)
87-88: Prefer typed Convex query references overas anycasts.Using
as anyfor both the query identifier and args removes compile-time safety and makes query-key drift a runtime surprise.Suggested direction
- useQuery( - "testSuites:getTestSuitesOverview" as any, - shouldSubscribe ? ({ workspaceId } as any) : "skip", - ); + useQuery( + api.testSuites.getTestSuitesOverview, + shouldSubscribe && workspaceId ? { workspaceId } : "skip", + );// import path may vary by project setup import { api } from "../convex/_generated/api";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/App.tsx` around lines 87 - 88, Replace the untyped query key and args cast with the generated Convex API references: import { api } from "../convex/_generated/api" and use api.testSuites.getTestSuitesOverview.query as the query identifier and pass the actual args object (e.g., { workspaceId }) directly when shouldSubscribe is true (otherwise keep "skip"); this removes the `as any` casts and restores compile-time safety for the query and its parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@mcpjam-inspector/client/src/App.tsx`:
- Around line 73-90: The file is calling useQuery inside the hook
useKeepCiEvalsOverviewWarm but useQuery is not imported; update the module
import from "convex/react" to include useQuery (add useQuery to the existing
import list alongside any other Convex hooks) so that useQuery is available to
the useKeepCiEvalsOverviewWarm function.
---
Nitpick comments:
In `@mcpjam-inspector/client/src/App.tsx`:
- Around line 87-88: Replace the untyped query key and args cast with the
generated Convex API references: import { api } from "../convex/_generated/api"
and use api.testSuites.getTestSuitesOverview.query as the query identifier and
pass the actual args object (e.g., { workspaceId }) directly when
shouldSubscribe is true (otherwise keep "skip"); this removes the `as any` casts
and restores compile-time safety for the query and its parameters.
In `@mcpjam-inspector/client/src/components/connection/ServerConnectionCard.tsx`:
- Around line 274-281: The serverUrl derivation duplicates logic already
encapsulated by getServerCommandDisplay; replace the inline ternary that
computes serverUrl with a call to getServerCommandDisplay(server.config) so the
component reuses the centralized utility and preserves behavior expected by
generateAgentBrief (which treats empty strings as falsy). Update references to
serverUrl accordingly and remove the duplicated construction logic to keep
derivation consistent with the imported getServerCommandDisplay helper.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 70afe4ab-2515-453b-8dec-d01acea7c2c0
📒 Files selected for processing (3)
mcpjam-inspector/client/src/App.tsxmcpjam-inspector/client/src/components/connection/ServerConnectionCard.tsxsdk/src/mcp-client-manager/MCPClientManager.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- sdk/src/mcp-client-manager/MCPClientManager.ts
| function useKeepCiEvalsOverviewWarm({ | ||
| isAuthenticated, | ||
| user, | ||
| workspaceId, | ||
| }: { | ||
| isAuthenticated: boolean; | ||
| user: unknown; | ||
| workspaceId: string | null; | ||
| }) { | ||
| const shouldSubscribe = | ||
| isAuthenticated && Boolean(user) && Boolean(workspaceId); | ||
|
|
||
| // Keep the CI suite list subscribed outside the tab so it stays current. | ||
| useQuery( | ||
| "testSuites:getTestSuitesOverview" as any, | ||
| shouldSubscribe ? ({ workspaceId } as any) : "skip", | ||
| ); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check the imports and the specific lines in question
head -20 mcpjam-inspector/client/src/App.tsx | cat -nRepository: MCPJam/inspector
Length of output: 1265
🏁 Script executed:
# Check lines around 73-90 to verify the useQuery call
sed -n '73,90p' mcpjam-inspector/client/src/App.tsx | cat -nRepository: MCPJam/inspector
Length of output: 638
🏁 Script executed:
# Search for all useQuery imports/usage in the file
rg "useQuery" mcpjam-inspector/client/src/App.tsx -nRepository: MCPJam/inspector
Length of output: 74
🏁 Script executed:
# Double-check if useQuery might be imported from anywhere else in the file
rg "import.*useQuery" mcpjam-inspector/client/src/App.tsxRepository: MCPJam/inspector
Length of output: 42
Add useQuery to the import from convex/react.
Line 86 invokes useQuery, but it is not imported on line 1, causing a compile-time error.
Fix
-import { useConvexAuth } from "convex/react";
+import { useConvexAuth, useQuery } from "convex/react";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mcpjam-inspector/client/src/App.tsx` around lines 73 - 90, The file is
calling useQuery inside the hook useKeepCiEvalsOverviewWarm but useQuery is not
imported; update the module import from "convex/react" to include useQuery (add
useQuery to the existing import list alongside any other Convex hooks) so that
useQuery is available to the useKeepCiEvalsOverviewWarm function.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| expandJsonStrings | ||
| className="max-h-72" | ||
| /> | ||
| </div> |
There was a problem hiding this comment.
Identical JSX duplicated across two conditional branches
Low Severity
The structured and renderAsBlock branches in renderArguments render byte-for-byte identical JSX (same JsonEditor with identical props and wrapping div). These two branches can be collapsed into a single condition like formattedValue.kind === "structured" || formattedValue.renderAsBlock, eliminating the duplication and reducing the risk of them drifting apart during future edits.
| } | ||
| } | ||
| return map; | ||
| }, [members]); |
There was a problem hiding this comment.
Duplicated userMap computation across two tab components
Low Severity
The userMap useMemo computation that transforms workspace members into a Map<string, { name, imageUrl }> is duplicated verbatim between CiEvalsTab and EvalsTab. This identical logic (iterating members, checking m.userId && m.user, building the map) could be extracted into a shared hook or utility alongside the useWorkspaceMembers call to avoid maintaining two copies.
Additional Locations (1)
There was a problem hiding this comment.
Actionable comments posted: 9
♻️ Duplicate comments (3)
sdk/src/eval-run-reporter.ts (1)
129-141:⚠️ Potential issue | 🔴 CriticalSerialize reporter state transitions.
record(),flush(), andfinalize()still mutate the same state acrossawaitpoints with no in-flight guard. Concurrent callers can start the run twice, append the same buffer twice, or finalize while a flush is still draining it. Please funneladd/record/flush/finalizethrough a single promise queue or mutex.Also applies to: 202-259, 261-321
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/src/eval-run-reporter.ts` around lines 129 - 141, The methods add, record, flush, and finalize on EvalRunReporter mutate shared state (buffered, addedCount, finalized) across await points and must be serialized to prevent concurrent modifications; implement a single async mutex or a promise queue inside the class (e.g., a private field like _opLock or _queuePromise) and make record, flush, finalize (and add if it can be called concurrently) acquire the lock before touching state and release it after the operation completes so only one of these operations runs at a time; ensure record awaits the lock, calls add, computes preview via chunkResultsForUpload, conditionally awaits flush while still holding the serialization lock (or sequences flush through the same queue), and have finalize wait for any in-flight flushes by using the same mutex/queue to prevent double-appends or finalizing during a flush.sdk/src/artifact-parsers/jest-json.ts (1)
39-46:⚠️ Potential issue | 🟠 MajorMake Jest status handling exhaustive.
Jest's
assertionResults[].statuscan be"passed","failed","skipped","pending","todo","disabled", or"focused". Your code skips only the first four of these non-pass/fail statuses; the"focused"status is valid and can appear in serialized results, but silently gets treated as a failure. In an eval-reporting path, this corrupts pass/fail data. Handle all supported statuses explicitly and throw on unknowns.Proposed fix
- const status = assertion.status ?? "failed"; - const isSkipped = - status === "skipped" || - status === "pending" || - status === "todo" || - status === "disabled"; - if (isSkipped) { - continue; + const status = assertion.status; + switch (status) { + case "passed": + case "failed": + break; + case "skipped": + case "pending": + case "todo": + case "disabled": + case "focused": + continue; + default: + throw new Error( + `Unsupported Jest assertion status: ${String(status)}` + ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/src/artifact-parsers/jest-json.ts` around lines 39 - 46, The status handling for Jest assertions is not exhaustive: update the logic that reads assertion.status (the local variable status and the isSkipped check) to explicitly handle all known Jest statuses ("passed", "failed", "skipped", "pending", "todo", "disabled", "focused") instead of only treating some as skipped; treat "passed" and "failed" as their respective outcomes, treat the group ("skipped","pending","todo","disabled","focused") as non-counting/skipped (or however your domain expects focused to behave), and throw an error for any unknown status string so unexpected values fail fast; implement this by replacing the current isSkipped boolean with an explicit switch/case or exhaustive if/else over status (using assertion.status or status) and ensure unknown branches throw.mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx (1)
84-95:⚠️ Potential issue | 🟠 MajorFall back to
resolvedToolOutputfor widget metadata.
rawOutputis optional in this prop contract. If a caller only providestoolOutput, these fallbacks miss_metaand_serverIdand can hit the failed-to-load path even though Line 88 already computed the normalized output.🐛 Proposed fix
- const effectiveToolMeta = - renderOverride?.toolMetadata ?? - toolMetadata ?? - readToolResultMeta(rawOutput); const resolvedToolOutput = toolOutput ?? rawOutput; - const uiType = detectUIType(effectiveToolMeta, rawOutput ?? toolOutput); + const effectiveToolMeta = + renderOverride?.toolMetadata ?? + toolMetadata ?? + readToolResultMeta(resolvedToolOutput); + const uiType = detectUIType(effectiveToolMeta, resolvedToolOutput); const uiResourceUri = renderOverride?.resourceUri ?? getUIResourceUri(uiType, effectiveToolMeta); const serverId = renderOverride?.serverId ?? getToolServerId(toolName, toolServerMap) ?? - readToolResultServerId(rawOutput); + readToolResultServerId(resolvedToolOutput);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx` around lines 84 - 95, The fallback logic uses rawOutput directly which is optional; replace uses of rawOutput with the already-normalized resolvedToolOutput so metadata and server id resolve when only toolOutput is provided: compute effectiveToolMeta by calling readToolResultMeta(resolvedToolOutput) instead of readToolResultMeta(rawOutput), pass resolvedToolOutput into detectUIType(...) instead of rawOutput ?? toolOutput, and use readToolResultServerId(resolvedToolOutput) when computing serverId; keep existing fallbacks (renderOverride/toolMetadata/toolServerMap) otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx`:
- Around line 98-113: The waiting banner wrongly shows for failed OpenAI widgets
because the toolState check in the UIType.OPENAI_SDK branch doesn't treat
"output-error" as a terminal state; update the condition that currently checks
toolState !== "output-available" && toolState !== "approval-requested" &&
toolState !== "output-denied" to also consider "output-error" as terminal (e.g.,
include toolState !== "output-error" in the combined check or refactor to check
inclusion in a terminalStates array) so the banner is not rendered when
toolState is "output-error".
In `@mcpjam-inspector/client/src/components/evals/iteration-details.tsx`:
- Around line 221-262: The merges for toolsWithSchema, toolServerMap, and
toolsMetadata use only tool.name so identical tool names across servers race;
update the code inside the listTools promise (serverNames.forEach ->
listTools(...).then(...)) to namespace keys with the serverId (e.g., use a
composite key like `${serverId}:${tool.name}`) when calling setToolsWithSchema,
setToolServerMap, and setToolsMetadata (and ensure you set the name field or a
separate lookup for display if you still want to show the plain tool.name). Make
the keying consistent across all three setters (setToolsWithSchema,
setToolServerMap, setToolsMetadata) so lookups are unambiguous and update any
consumers that read these maps to use the composite key or a stable
server-scoped lookup.
- Around line 118-140: resolveTraceModel builds a fallback ModelDefinition but
loses the original custom provider name by collapsing unknown providers to
provider: "custom"; update the returned fallback object (in resolveTraceModel)
to set customProviderName: provider when normalizeModelProvider(provider) ===
"custom" (or when provider is not in KNOWN_MODEL_PROVIDERS) so the original
provider string is preserved, while keeping provider:
normalizeModelProvider(provider), id: providerModelId, and name as currently
computed; use the existing symbols getModelById, normalizeModelProvider, and
ModelDefinition to implement this small conditional field addition.
In `@mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts`:
- Around line 304-337: The current shouldAttemptWidgetReplay incorrectly blocks
replay when toolsMetadata[toolName] is missing; update shouldAttemptWidgetReplay
(and its use of liveToolMetadata/getUIResourceUri) so that replay is allowed if
the trace itself is self-describing:
detectUIType(params.toolsMetadata[params.toolName], params.rawToolOutput)
indicates a widget UI or the snapshot/rawToolOutput contains _meta.ui and/or
_serverId (use existing readToolResultServerId and widgetSnapshotMap), and then
use historicalServerId when present to check connectedServerIds even if
liveToolMetadata is absent; only require liveServerId (from getToolServerId)
when comparing equality to historicalServerId, otherwise allow replay based on
historicalServerId being in params.connectedServerIds.
In `@sdk/src/artifact-parsers/junit-xml.ts`:
- Around line 3-11: The parser currently stores raw XML-escaped values; update
parseAttributes to decode XML entities before returning values (so attributes
like "a & b" become "a & b") and ensure places that assign caseTitle, query,
externalIterationId, and error use the decoded strings; implement an
XML-unescape helper (referenced by parseAttributes) that converts &, <,
>, ", ' and numeric entities, call it inside parseAttributes when
setting attributes[key], and also apply it to any direct text/CDAT A extraction
sites where caseTitle, query, externalIterationId, or error are populated (so
all these fields receive decoded, human-readable text).
- Around line 37-64: The parser currently treats <skipped> as passed: false
(using isSkipped and passed) which turns skipped tests into failures; update the
logic in the junit-xml parser around the isSkipped/passed calculation (and the
subsequent errorMessage handling using failureMatch/skippedMatch) to exclude
skipped cases from being reported as failures—either return/skip creating an
EvalResultInput when isSkipped is true, or explicitly model a skipped result
(e.g., set a separate status or passed=true and a skipped flag) before mapping
to EvalResultInput; ensure the change touches the isSkipped/passed branch and
the errorMessage assignment (where failureMatch/skippedMatch are used) so
skipped tests are not converted into failed evals.
In `@sdk/src/eval-run-reporter.ts`:
- Around line 242-249: The current withUniqueExternalIterationIds logic
generates externalIterationId values only for the temporary uploadReady array so
retries or mixed batches can produce duplicates; modify
withUniqueExternalIterationIds (used before chunkResultsForUpload and
appendEvalRunIterations) to assign and persist IDs back into this.buffered so
the generatedIterationCount advances in sync with stored items, and change the
generation algorithm to derive missing IDs from the full batch (or from a set of
already-present externalIterationId values in this.buffered) rather than
counting only missing entries; ensure generatedIterationCount is updated only
once per newly assigned ID and that uploadReady reuses the persisted IDs from
this.buffered so subsequent retries do not regenerate conflicting IDs.
- Around line 251-258: The fallback path currently uses
buildLocalFallbackResult(), which summarizes this.buffered (the live upload
buffer) that may have been cleared after successful flushes; change it to base
the non-strict fallback on a cumulative, immutable summary of all added results
instead of this.buffered. Introduce or use an aggregate store (e.g.,
this.cumulativeResults or this.allBufferedEntries) that is appended to whenever
results are added/flush succeeds, update buildLocalFallbackResult() (and any
fallback uses at the other locations mentioned) to read from that cumulative
store, and ensure the catch blocks that set this.completedResult and
this.finalized use the cumulative summary so previously uploaded passing
iterations aren’t lost when this.input.strict is false.
- Around line 267-280: The finalize() path builds reportInput without preserving
this.expectedIterations, so small/one-shot runs lose expectedIterations set via
setExpectedIterations(); update finalize() to include this.expectedIterations in
the ReportEvalResultsInput (same field name used by startEvalRun()/flush())
before calling reportEvalResults(), ensuring reportInput contains
expectedIterations alongside suiteName, serverNames, results, etc., so metadata
matches the flush/startEvalRun path.
---
Duplicate comments:
In `@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx`:
- Around line 84-95: The fallback logic uses rawOutput directly which is
optional; replace uses of rawOutput with the already-normalized
resolvedToolOutput so metadata and server id resolve when only toolOutput is
provided: compute effectiveToolMeta by calling
readToolResultMeta(resolvedToolOutput) instead of readToolResultMeta(rawOutput),
pass resolvedToolOutput into detectUIType(...) instead of rawOutput ??
toolOutput, and use readToolResultServerId(resolvedToolOutput) when computing
serverId; keep existing fallbacks (renderOverride/toolMetadata/toolServerMap)
otherwise.
In `@sdk/src/artifact-parsers/jest-json.ts`:
- Around line 39-46: The status handling for Jest assertions is not exhaustive:
update the logic that reads assertion.status (the local variable status and the
isSkipped check) to explicitly handle all known Jest statuses ("passed",
"failed", "skipped", "pending", "todo", "disabled", "focused") instead of only
treating some as skipped; treat "passed" and "failed" as their respective
outcomes, treat the group ("skipped","pending","todo","disabled","focused") as
non-counting/skipped (or however your domain expects focused to behave), and
throw an error for any unknown status string so unexpected values fail fast;
implement this by replacing the current isSkipped boolean with an explicit
switch/case or exhaustive if/else over status (using assertion.status or status)
and ensure unknown branches throw.
In `@sdk/src/eval-run-reporter.ts`:
- Around line 129-141: The methods add, record, flush, and finalize on
EvalRunReporter mutate shared state (buffered, addedCount, finalized) across
await points and must be serialized to prevent concurrent modifications;
implement a single async mutex or a promise queue inside the class (e.g., a
private field like _opLock or _queuePromise) and make record, flush, finalize
(and add if it can be called concurrently) acquire the lock before touching
state and release it after the operation completes so only one of these
operations runs at a time; ensure record awaits the lock, calls add, computes
preview via chunkResultsForUpload, conditionally awaits flush while still
holding the serialization lock (or sequences flush through the same queue), and
have finalize wait for any in-flight flushes by using the same mutex/queue to
prevent double-appends or finalizing during a flush.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 412db82c-1aba-4fa2-a129-3d281de4a5e9
📒 Files selected for processing (11)
mcpjam-inspector/client/src/App.tsxmcpjam-inspector/client/src/components/chat-v2/thread/part-switch.tsxmcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsxmcpjam-inspector/client/src/components/evals/__tests__/trace-viewer-adapter.test.tsmcpjam-inspector/client/src/components/evals/iteration-details.tsxmcpjam-inspector/client/src/components/evals/trace-viewer-adapter.tsmcpjam-inspector/client/src/lib/tool-result-utils.tssdk/src/artifact-parsers/jest-json.tssdk/src/artifact-parsers/junit-xml.tssdk/src/artifact-parsers/vitest-json.tssdk/src/eval-run-reporter.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- sdk/src/artifact-parsers/vitest-json.ts
- mcpjam-inspector/client/src/App.tsx
| if ( | ||
| uiType === UIType.OPENAI_SDK || | ||
| (uiType === UIType.OPENAI_SDK_AND_MCP_APPS && | ||
| selectedProtocolOverrideIfBothExists === UIType.OPENAI_SDK) | ||
| ) { | ||
| if ( | ||
| toolState !== "output-available" && | ||
| toolState !== "approval-requested" && | ||
| toolState !== "output-denied" | ||
| ) { | ||
| return ( | ||
| <div className="border border-border/40 rounded-md bg-muted/30 text-xs text-muted-foreground px-3 py-2"> | ||
| Waiting for tool to finish executing... | ||
| </div> | ||
| ); | ||
| } |
There was a problem hiding this comment.
Treat output-error as terminal here.
This branch currently renders the “Waiting for tool to finish executing...” banner for failed OpenAI widgets, which makes a completed error look hung.
🐛 Proposed fix
if (
+ toolState === "output-error"
+ ) {
+ return null;
+ }
+
+ if (
toolState !== "output-available" &&
toolState !== "approval-requested" &&
toolState !== "output-denied"
) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mcpjam-inspector/client/src/components/chat-v2/thread/widget-replay.tsx`
around lines 98 - 113, The waiting banner wrongly shows for failed OpenAI
widgets because the toolState check in the UIType.OPENAI_SDK branch doesn't
treat "output-error" as a terminal state; update the condition that currently
checks toolState !== "output-available" && toolState !== "approval-requested" &&
toolState !== "output-denied" to also consider "output-error" as terminal (e.g.,
include toolState !== "output-error" in the combined check or refactor to check
inclusion in a terminalStates array) so the banner is not rendered when
toolState is "output-error".
| function resolveTraceModel( | ||
| iteration: EvalIteration, | ||
| testCase: EvalCase | null, | ||
| ): ModelDefinition { | ||
| const snapshotProvider = iteration.testCaseSnapshot?.provider; | ||
| const snapshotModel = iteration.testCaseSnapshot?.model; | ||
| const fallbackProvider = testCase?.models[0]?.provider; | ||
| const fallbackModel = testCase?.models[0]?.model; | ||
|
|
||
| const provider = snapshotProvider || fallbackProvider || "openai"; | ||
| const model = snapshotModel || fallbackModel || "unknown-model"; | ||
| const providerModelId = | ||
| model.startsWith(`${provider}/`) || !provider | ||
| ? model | ||
| : `${provider}/${model}`; | ||
|
|
||
| return ( | ||
| getModelById(providerModelId) ?? | ||
| getModelById(model) ?? { | ||
| id: providerModelId, | ||
| name: model.includes("/") ? model.split("/").slice(1).join("/") : model, | ||
| provider: normalizeModelProvider(provider), | ||
| } |
There was a problem hiding this comment.
Preserve custom provider identity in the fallback model.
When provider is not in KNOWN_MODEL_PROVIDERS, this fallback collapses it to provider: "custom" and drops the original name. That loses attribution for custom providers, even though ModelDefinition has customProviderName specifically for that case (mcpjam-inspector/shared/types.ts:148-156).
Possible fix
function resolveTraceModel(
iteration: EvalIteration,
testCase: EvalCase | null,
): ModelDefinition {
@@
- return (
+ const resolvedProvider = normalizeModelProvider(provider);
+
+ return (
getModelById(providerModelId) ??
getModelById(model) ?? {
id: providerModelId,
name: model.includes("/") ? model.split("/").slice(1).join("/") : model,
- provider: normalizeModelProvider(provider),
+ provider: resolvedProvider,
+ ...(resolvedProvider === "custom" &&
+ provider &&
+ provider !== "custom"
+ ? { customProviderName: provider }
+ : {}),
}
);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mcpjam-inspector/client/src/components/evals/iteration-details.tsx` around
lines 118 - 140, resolveTraceModel builds a fallback ModelDefinition but loses
the original custom provider name by collapsing unknown providers to provider:
"custom"; update the returned fallback object (in resolveTraceModel) to set
customProviderName: provider when normalizeModelProvider(provider) === "custom"
(or when provider is not in KNOWN_MODEL_PROVIDERS) so the original provider
string is preserved, while keeping provider: normalizeModelProvider(provider),
id: providerModelId, and name as currently computed; use the existing symbols
getModelById, normalizeModelProvider, and ModelDefinition to implement this
small conditional field addition.
| serverNames.forEach((serverId) => { | ||
| void listTools({ serverId }) | ||
| .then((result) => { | ||
| if (cancelled) return; | ||
|
|
||
| setConnectedServerIds((prev) => | ||
| prev.includes(serverId) ? prev : [...prev, serverId], | ||
| ); | ||
|
|
||
| if (result.tools?.length) { | ||
| setToolsWithSchema((prev) => { | ||
| const next = { ...prev }; | ||
| for (const tool of result.tools ?? []) { | ||
| next[tool.name] = { | ||
| name: tool.name, | ||
| inputSchema: tool.inputSchema, | ||
| }; | ||
| } | ||
| return next; | ||
| }); | ||
|
|
||
| // Extract metadata | ||
| const toolsMetadata = result.toolsMetadata ?? {}; | ||
| for (const [toolName, meta] of Object.entries(toolsMetadata)) { | ||
| metadata[toolName] = meta as Record<string, unknown>; | ||
| setToolServerMap((prev) => { | ||
| const next = { ...prev }; | ||
| for (const tool of result.tools ?? []) { | ||
| next[tool.name] = serverId; | ||
| } | ||
| } catch (error) { | ||
| // Silently fail for disconnected servers | ||
| console.warn( | ||
| `Failed to fetch tools for server ${serverId}:`, | ||
| error, | ||
| ); | ||
| } | ||
| }), | ||
| ); | ||
|
|
||
| setToolsWithSchema(toolsMap); | ||
| setToolsMetadata(metadata); | ||
| setToolServerMap(toolServerMap); | ||
| } catch (error) { | ||
| // Silently fail if servers aren't connected | ||
| // This is expected in evals where servers may not be running | ||
| setToolsMetadata({}); | ||
| setToolServerMap({}); | ||
| setToolsWithSchema({}); | ||
| } | ||
| return next; | ||
| }); | ||
| } | ||
|
|
||
| if (result.toolsMetadata) { | ||
| setToolsMetadata((prev) => ({ | ||
| ...prev, | ||
| ...Object.fromEntries( | ||
| Object.entries(result.toolsMetadata ?? {}).map( | ||
| ([toolName, meta]) => [ | ||
| toolName, | ||
| meta as Record<string, unknown>, | ||
| ], | ||
| ), | ||
| ), | ||
| })); |
There was a problem hiding this comment.
Bare tool names make multi-server metadata nondeterministic.
These merges key toolsWithSchema, toolServerMap, and toolsMetadata only by tool.name. If two connected servers expose the same tool name, the last request to settle wins, so schema rendering and server attribution can silently point at the wrong server. Please namespace these maps by server or use a stable composite key.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mcpjam-inspector/client/src/components/evals/iteration-details.tsx` around
lines 221 - 262, The merges for toolsWithSchema, toolServerMap, and
toolsMetadata use only tool.name so identical tool names across servers race;
update the code inside the listTools promise (serverNames.forEach ->
listTools(...).then(...)) to namespace keys with the serverId (e.g., use a
composite key like `${serverId}:${tool.name}`) when calling setToolsWithSchema,
setToolServerMap, and setToolsMetadata (and ensure you set the name field or a
separate lookup for display if you still want to show the plain tool.name). Make
the keying consistent across all three setters (setToolsWithSchema,
setToolServerMap, setToolsMetadata) so lookups are unambiguous and update any
consumers that read these maps to use the composite key or a stable
server-scoped lookup.
| function shouldAttemptWidgetReplay(params: { | ||
| toolName: string; | ||
| toolCallId: string; | ||
| rawToolOutput: unknown; | ||
| toolsMetadata: Record<string, Record<string, any>>; | ||
| toolServerMap: ToolServerMap; | ||
| connectedServerIds: Set<string>; | ||
| widgetSnapshotMap: Map<string, TraceWidgetSnapshot>; | ||
| }) { | ||
| const liveToolMetadata = params.toolsMetadata[params.toolName]; | ||
| const liveUiType = detectUIType(liveToolMetadata, params.rawToolOutput); | ||
| const liveServerId = getToolServerId(params.toolName, params.toolServerMap); | ||
| const snapshot = params.widgetSnapshotMap.get(params.toolCallId); | ||
| const historicalServerId = | ||
| snapshot?.serverId ?? readToolResultServerId(params.rawToolOutput); | ||
|
|
||
| const hasCurrentWidgetResolution = | ||
| !!liveToolMetadata && | ||
| !!liveServerId && | ||
| isWidgetUiType(liveUiType) && | ||
| !!getUIResourceUri(liveUiType, liveToolMetadata); | ||
|
|
||
| if (!hasCurrentWidgetResolution || !liveServerId) { | ||
| return false; | ||
| } | ||
|
|
||
| if (historicalServerId) { | ||
| return ( | ||
| liveServerId === historicalServerId && | ||
| params.connectedServerIds.has(historicalServerId) | ||
| ); | ||
| } | ||
|
|
||
| return params.connectedServerIds.has(liveServerId); |
There was a problem hiding this comment.
Don’t require live tool catalog metadata to replay a self-describing trace.
A trace result can already carry _meta.ui and _serverId, and the downstream rendering path knows how to use those fields. With the current gate, those traces get scrubbed whenever toolsMetadata[toolName] is missing, even if the historical server is still connected.
🐛 Proposed fix
function shouldAttemptWidgetReplay(params: {
toolName: string;
toolCallId: string;
rawToolOutput: unknown;
toolsMetadata: Record<string, Record<string, any>>;
toolServerMap: ToolServerMap;
connectedServerIds: Set<string>;
widgetSnapshotMap: Map<string, TraceWidgetSnapshot>;
}) {
- const liveToolMetadata = params.toolsMetadata[params.toolName];
- const liveUiType = detectUIType(liveToolMetadata, params.rawToolOutput);
- const liveServerId = getToolServerId(params.toolName, params.toolServerMap);
+ const streamedToolMetadata = readToolResultMeta(params.rawToolOutput);
+ const liveToolMetadata = params.toolsMetadata[params.toolName];
+ const effectiveToolMetadata = liveToolMetadata ?? streamedToolMetadata;
+ const liveUiType = detectUIType(
+ effectiveToolMetadata,
+ params.rawToolOutput,
+ );
+ const liveServerId =
+ getToolServerId(params.toolName, params.toolServerMap) ??
+ readToolResultServerId(params.rawToolOutput);
const snapshot = params.widgetSnapshotMap.get(params.toolCallId);
const historicalServerId =
snapshot?.serverId ?? readToolResultServerId(params.rawToolOutput);
const hasCurrentWidgetResolution =
- !!liveToolMetadata &&
+ !!effectiveToolMetadata &&
!!liveServerId &&
isWidgetUiType(liveUiType) &&
- !!getUIResourceUri(liveUiType, liveToolMetadata);
+ !!getUIResourceUri(liveUiType, effectiveToolMetadata);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mcpjam-inspector/client/src/components/evals/trace-viewer-adapter.ts` around
lines 304 - 337, The current shouldAttemptWidgetReplay incorrectly blocks replay
when toolsMetadata[toolName] is missing; update shouldAttemptWidgetReplay (and
its use of liveToolMetadata/getUIResourceUri) so that replay is allowed if the
trace itself is self-describing:
detectUIType(params.toolsMetadata[params.toolName], params.rawToolOutput)
indicates a widget UI or the snapshot/rawToolOutput contains _meta.ui and/or
_serverId (use existing readToolResultServerId and widgetSnapshotMap), and then
use historicalServerId when present to check connectedServerIds even if
liveToolMetadata is absent; only require liveServerId (from getToolServerId)
when comparing equality to historicalServerId, otherwise allow replay based on
historicalServerId being in params.connectedServerIds.
| function parseAttributes(raw: string): Record<string, string> { | ||
| const attributes: Record<string, string> = {}; | ||
| const attrRegex = /([A-Za-z_:][\w:.-]*)=(["'])([\s\S]*?)\2/g; | ||
| for (const match of raw.matchAll(attrRegex)) { | ||
| const key = match[1]; | ||
| const value = match[3] ?? ""; | ||
| attributes[key] = value; | ||
| } | ||
| return attributes; |
There was a problem hiding this comment.
Decode XML entities before storing titles and errors.
Valid JUnit XML escapes characters like &, <, >, and quotes. Right now those escaped forms flow straight into caseTitle, query, externalIterationId, and error, so names such as a & b and failure text inside CDATA/entity-escaped content will be surfaced verbatim instead of as the real message.
Also applies to: 29-31, 53-59
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk/src/artifact-parsers/junit-xml.ts` around lines 3 - 11, The parser
currently stores raw XML-escaped values; update parseAttributes to decode XML
entities before returning values (so attributes like "a & b" become "a & b")
and ensure places that assign caseTitle, query, externalIterationId, and error
use the decoded strings; implement an XML-unescape helper (referenced by
parseAttributes) that converts &, <, >, ", ' and numeric
entities, call it inside parseAttributes when setting attributes[key], and also
apply it to any direct text/CDAT A extraction sites where caseTitle, query,
externalIterationId, or error are populated (so all these fields receive
decoded, human-readable text).
| const hasFailure = /<failure\b/i.test(body); | ||
| const hasError = /<error\b/i.test(body); | ||
| const isSkipped = /<skipped\b/i.test(body); | ||
| const passed = !hasFailure && !hasError && !isSkipped; | ||
|
|
||
| let errorMessage: string | undefined; | ||
| if (!passed) { | ||
| // Match paired or self-closing failure/error/skipped elements | ||
| const failureMatch = body.match( | ||
| /<(failure|error)\b([^>]*?)(?:>([\s\S]*?)<\/\1>|\/?>)/i | ||
| ); | ||
| const skippedMatch = body.match( | ||
| /<skipped\b([^>]*?)(?:>([\s\S]*?)<\/skipped>|\/?>)/i | ||
| ); | ||
|
|
||
| if (failureMatch) { | ||
| const attrs = parseAttributes(failureMatch[2] ?? ""); | ||
| const innerText = failureMatch[3]?.trim(); | ||
| errorMessage = attrs.message || innerText; | ||
| } else if (skippedMatch) { | ||
| const attrs = parseAttributes(skippedMatch[1] ?? ""); | ||
| const innerText = skippedMatch[2]?.trim(); | ||
| errorMessage = attrs.message || innerText; | ||
| } | ||
|
|
||
| if (!errorMessage) { | ||
| errorMessage = isSkipped ? "Test skipped" : "JUnit testcase failed"; | ||
| } |
There was a problem hiding this comment.
Do not turn skipped JUnit cases into failed evals.
Unlike sdk/src/artifact-parsers/jest-json.ts:38-42, this parser reports <skipped> as passed: false. That will make a skipped CI test look like an actual eval failure and skew suite status in MCPJam. Skip these cases entirely, or model them separately before mapping to EvalResultInput.
Suggested fix
const hasFailure = /<failure\b/i.test(body);
const hasError = /<error\b/i.test(body);
const isSkipped = /<skipped\b/i.test(body);
- const passed = !hasFailure && !hasError && !isSkipped;
+ if (isSkipped) {
+ continue;
+ }
+ const passed = !hasFailure && !hasError;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const hasFailure = /<failure\b/i.test(body); | |
| const hasError = /<error\b/i.test(body); | |
| const isSkipped = /<skipped\b/i.test(body); | |
| const passed = !hasFailure && !hasError && !isSkipped; | |
| let errorMessage: string | undefined; | |
| if (!passed) { | |
| // Match paired or self-closing failure/error/skipped elements | |
| const failureMatch = body.match( | |
| /<(failure|error)\b([^>]*?)(?:>([\s\S]*?)<\/\1>|\/?>)/i | |
| ); | |
| const skippedMatch = body.match( | |
| /<skipped\b([^>]*?)(?:>([\s\S]*?)<\/skipped>|\/?>)/i | |
| ); | |
| if (failureMatch) { | |
| const attrs = parseAttributes(failureMatch[2] ?? ""); | |
| const innerText = failureMatch[3]?.trim(); | |
| errorMessage = attrs.message || innerText; | |
| } else if (skippedMatch) { | |
| const attrs = parseAttributes(skippedMatch[1] ?? ""); | |
| const innerText = skippedMatch[2]?.trim(); | |
| errorMessage = attrs.message || innerText; | |
| } | |
| if (!errorMessage) { | |
| errorMessage = isSkipped ? "Test skipped" : "JUnit testcase failed"; | |
| } | |
| const hasFailure = /<failure\b/i.test(body); | |
| const hasError = /<error\b/i.test(body); | |
| const isSkipped = /<skipped\b/i.test(body); | |
| if (isSkipped) { | |
| continue; | |
| } | |
| const passed = !hasFailure && !hasError; | |
| let errorMessage: string | undefined; | |
| if (!passed) { | |
| // Match paired or self-closing failure/error/skipped elements | |
| const failureMatch = body.match( | |
| /<(failure|error)\b([^>]*?)(?:>([\s\S]*?)<\/\1>|\/?>)/i | |
| ); | |
| const skippedMatch = body.match( | |
| /<skipped\b([^>]*?)(?:>([\s\S]*?)<\/skipped>|\/?>)/i | |
| ); | |
| if (failureMatch) { | |
| const attrs = parseAttributes(failureMatch[2] ?? ""); | |
| const innerText = failureMatch[3]?.trim(); | |
| errorMessage = attrs.message || innerText; | |
| } else if (skippedMatch) { | |
| const attrs = parseAttributes(skippedMatch[1] ?? ""); | |
| const innerText = skippedMatch[2]?.trim(); | |
| errorMessage = attrs.message || innerText; | |
| } | |
| if (!errorMessage) { | |
| errorMessage = isSkipped ? "Test skipped" : "JUnit testcase failed"; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk/src/artifact-parsers/junit-xml.ts` around lines 37 - 64, The parser
currently treats <skipped> as passed: false (using isSkipped and passed) which
turns skipped tests into failures; update the logic in the junit-xml parser
around the isSkipped/passed calculation (and the subsequent errorMessage
handling using failureMatch/skippedMatch) to exclude skipped cases from being
reported as failures—either return/skip creating an EvalResultInput when
isSkipped is true, or explicitly model a skipped result (e.g., set a separate
status or passed=true and a skipped flag) before mapping to EvalResultInput;
ensure the change touches the isSkipped/passed branch and the errorMessage
assignment (where failureMatch/skippedMatch are used) so skipped tests are not
converted into failed evals.
| const uploadReady = this.withUniqueExternalIterationIds(this.buffered); | ||
| const chunks = chunkResultsForUpload(uploadReady); | ||
| for (const chunk of chunks) { | ||
| await appendEvalRunIterations(this.runtimeConfig, { | ||
| runId: this.runId, | ||
| results: chunk, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Keep externalIterationId generation stable and truly unique.
The generated IDs live only in uploadReady, not in this.buffered. If one chunk succeeds and a later chunk fails, a retry regenerates a fresh ID range because generatedIterationCount has already advanced. The helper also counts only missing entries, so a mixed batch like [{ externalIterationId: "${externalRunId}-1" }, {}] becomes a duplicate on the first flush. Persist the assigned IDs back into this.buffered, and derive missing IDs from the full batch or an existing-ID set rather than the missing-only counter.
Also applies to: 337-349
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk/src/eval-run-reporter.ts` around lines 242 - 249, The current
withUniqueExternalIterationIds logic generates externalIterationId values only
for the temporary uploadReady array so retries or mixed batches can produce
duplicates; modify withUniqueExternalIterationIds (used before
chunkResultsForUpload and appendEvalRunIterations) to assign and persist IDs
back into this.buffered so the generatedIterationCount advances in sync with
stored items, and change the generation algorithm to derive missing IDs from the
full batch (or from a set of already-present externalIterationId values in
this.buffered) rather than counting only missing entries; ensure
generatedIterationCount is updated only once per newly assigned ID and that
uploadReady reuses the persisted IDs from this.buffered so subsequent retries do
not regenerate conflicting IDs.
| } catch (error) { | ||
| if (this.input.strict) { | ||
| throw error; | ||
| } | ||
| this.completedResult = this.buildLocalFallbackResult(); | ||
| this.finalized = true; | ||
| this.buffered = []; | ||
| } |
There was a problem hiding this comment.
Base the non-strict fallback on cumulative results, not the live buffer.
buildLocalFallbackResult() only summarizes this.buffered, which is cleared after every successful flush. Once any incremental upload has succeeded, a later non-strict failure can return total: 0 and even result: "failed" for a run that already uploaded passing iterations. Keep a running summary, or an immutable copy of all added results, separate from the upload buffer.
Also applies to: 312-319, 364-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk/src/eval-run-reporter.ts` around lines 251 - 258, The fallback path
currently uses buildLocalFallbackResult(), which summarizes this.buffered (the
live upload buffer) that may have been cleared after successful flushes; change
it to base the non-strict fallback on a cumulative, immutable summary of all
added results instead of this.buffered. Introduce or use an aggregate store
(e.g., this.cumulativeResults or this.allBufferedEntries) that is appended to
whenever results are added/flush succeeds, update buildLocalFallbackResult()
(and any fallback uses at the other locations mentioned) to read from that
cumulative store, and ensure the catch blocks that set this.completedResult and
this.finalized use the cumulative summary so previously uploaded passing
iterations aren’t lost when this.input.strict is false.
| if (!this.runId) { | ||
| const reportInput: ReportEvalResultsInput = { | ||
| suiteName: this.input.suiteName, | ||
| suiteDescription: this.input.suiteDescription, | ||
| serverNames: this.input.serverNames, | ||
| notes: this.input.notes, | ||
| passCriteria: this.input.passCriteria, | ||
| externalRunId: this.externalRunId, | ||
| ci: this.withoutCiProvider(this.input.ci), | ||
| apiKey: this.input.apiKey, | ||
| baseUrl: this.input.baseUrl, | ||
| strict: this.input.strict, | ||
| results: this.buffered, | ||
| }; |
There was a problem hiding this comment.
Preserve expectedIterations in the one-shot path.
flush() forwards this.expectedIterations to startEvalRun(), but finalize() drops it when it builds reportInput for reportEvalResults(...). Small suites that never cross the auto-flush threshold will therefore report different metadata, and setExpectedIterations() becomes ineffective unless a flush happened first.
Suggested patch
const reportInput: ReportEvalResultsInput = {
suiteName: this.input.suiteName,
suiteDescription: this.input.suiteDescription,
serverNames: this.input.serverNames,
notes: this.input.notes,
passCriteria: this.input.passCriteria,
externalRunId: this.externalRunId,
ci: this.withoutCiProvider(this.input.ci),
+ expectedIterations: this.expectedIterations,
apiKey: this.input.apiKey,
baseUrl: this.input.baseUrl,
strict: this.input.strict,
results: this.buffered,
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (!this.runId) { | |
| const reportInput: ReportEvalResultsInput = { | |
| suiteName: this.input.suiteName, | |
| suiteDescription: this.input.suiteDescription, | |
| serverNames: this.input.serverNames, | |
| notes: this.input.notes, | |
| passCriteria: this.input.passCriteria, | |
| externalRunId: this.externalRunId, | |
| ci: this.withoutCiProvider(this.input.ci), | |
| apiKey: this.input.apiKey, | |
| baseUrl: this.input.baseUrl, | |
| strict: this.input.strict, | |
| results: this.buffered, | |
| }; | |
| if (!this.runId) { | |
| const reportInput: ReportEvalResultsInput = { | |
| suiteName: this.input.suiteName, | |
| suiteDescription: this.input.suiteDescription, | |
| serverNames: this.input.serverNames, | |
| notes: this.input.notes, | |
| passCriteria: this.input.passCriteria, | |
| externalRunId: this.externalRunId, | |
| ci: this.withoutCiProvider(this.input.ci), | |
| expectedIterations: this.expectedIterations, | |
| apiKey: this.input.apiKey, | |
| baseUrl: this.input.baseUrl, | |
| strict: this.input.strict, | |
| results: this.buffered, | |
| }; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk/src/eval-run-reporter.ts` around lines 267 - 280, The finalize() path
builds reportInput without preserving this.expectedIterations, so small/one-shot
runs lose expectedIterations set via setExpectedIterations(); update finalize()
to include this.expectedIterations in the ReportEvalResultsInput (same field
name used by startEvalRun()/flush()) before calling reportEvalResults(),
ensuring reportInput contains expectedIterations alongside suiteName,
serverNames, results, etc., so metadata matches the flush/startEvalRun path.
- Restore "Evals" sidebar entry (#evals) alongside new "CI Evals" (#ci-evals) - Add missing selectedIterationId and onSelectIteration props to RunDetailView in SuiteIterationsView (required after #1541 refactored RunDetailView) - Add iteration auto-selection and URL-based routing for the original Evals tab, matching the behavior already implemented in CiSuiteDetail Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>


Note
Medium Risk
Moderate UI and data-display expansion plus changes to eval result derivation (honoring
resultSource: "reported") and widget replay/security defaults; regressions could affect pass/fail reporting or widget rendering in the Inspector.Overview
Updates SDK docs to cover MCPJam eval auto-reporting + manual reporting APIs, adds a new
Eval Reportingreference page, and refreshesEvalTest/EvalSuitedocs to reflect updated configs, return types, run options, and MCPJam upload behavior.Adds a new Inspector CI Evals tab (hash-routed) for viewing SDK-ingested eval suites/runs in a workspace, including CI metadata display, read-only run/test drilldowns, and better iteration/trace UX (formatted/raw tool calls, model resolution, and widget snapshot replay via new
WidgetReplay+ trace adapter). Several chat/widget components now support a non-interactive mode, persisted render overrides (CSP/permissions/permissive/border), and cached-widget replay hardening (force permissive for cached HTML). Also adds “Copy markdown for server evals” from the server connection menu and threads workspace info into Settings.Written by Cursor Bugbot for commit 2b3e770. This will update automatically on new commits. Configure here.