Skip to content

feat(mcp): surface run_id in run-backed tool results for citation (DRC-3532)#1418

Open
iamcxa wants to merge 5 commits into
mainfrom
feature/drc-3532-mcp-expose-run-id
Open

feat(mcp): surface run_id in run-backed tool results for citation (DRC-3532)#1418
iamcxa wants to merge 5 commits into
mainfrom
feature/drc-3532-mcp-expose-run-id

Conversation

@iamcxa

@iamcxa iamcxa commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Closes DRC-3532.

Why (cross-repo, DRC-3532)

The recce-cloud summary agent wants to deep-link each factual statement in its PR
summary to the exact run it executed, via deterministic {{run:<run_id>}} inline
markers (server-side marker replacement already exists in recce-cloud-infra's
recce_task_func). But the agent never received the run_id: the cloud-backend
MCP tools that back the agent's analysis (row_count_diff, profile_diff,
value_diff, query, query_diff, top_k_diff, histogram_diff) route through
_tool_run_backed, which returned only run["result"] and dropped run_id.

Without run_id the agent cannot cite runs deterministically, forcing fragile
server-side fuzzy prose matching (low/unstable coverage, occasional wrong links).

What

_tool_run_backed now merges run_id into the result dict (additive — existing
result fields preserved). Only added when the response actually carries a run_id
and the result is a dict, so run-less or non-dict responses are untouched (the
agent must never be handed, or synthesize, a run_id it was not given).

Test

tests/test_mcp_cloud_backend.py:

  • new: run-backed result surfaces run_id (merged, fields preserved)
  • new: run-less response leaves the result untouched (no invented run_id)
  • updated: the existing run-tool routing test now expects the surfaced run_id

40 cloud-backend tests + 123 local mcp_server tests green.

Paired change

recce-cloud-infra: the summary agent prompt emits {{run:<run_id>}} markers
using this run_id; fuzzy linkify is demoted to a legacy fallback. Tracked under
DRC-3532. Durable structured-citation design is DRC-3634.

🤖 Generated with Claude Code


Consumed by (cross-repo)

recce-cloud-infra PR DataRecce/recce-cloud-infra#1427 (Summary v2: trust architecture — value-grounding + run citations) depends on this PR. The instance-launcher image bakes the recce build from this branch; #1427's run-citation deep links ([value](…/runs/<id>)) and profile_diff-grounded row counts require the changes here. Suggested merge order: this PR → recce release/image → #1427.

Also on this branch — profile_diff lowercase-column fix (DRC-3674 B)

Commit 889a01b7 (fix(profile): resolve lowercase columns to physical names on case-folding warehouses): ProfileDiffTask / ProfileTask filtered requested columns with a case-sensitive column.name in selected_columns test. On Snowflake (physical names UPPERCASE) the agent's lowercase column names matched nothing → the filter dropped every column → empty profile → the cloud summary suppressed the whole analysis. Fix lowercases both sides of the membership test (no-op on duckdb; SQL path unchanged). Regression test reproduces the empty-profile case with an uppercase-header duckdb table. This unblocks #1427's CLV/profile grounding (which was intermittently empty in UAT depending on the LLM's column casing).

…C-3532)

The cloud-backend MCP tools (row_count_diff, profile_diff, value_diff, query,
query_diff, top_k_diff, histogram_diff) routed through _tool_run_backed returned
only run["result"], dropping run_id. The recce-cloud summary agent therefore
never saw the run_id and could not emit deterministic {{run:<run_id>}} citation
markers, forcing fragile server-side fuzzy prose matching.

Merge run_id into the result dict (additive; existing result fields preserved).
Only added when the response carries a run_id and the result is a dict, so
run-less or non-dict responses are untouched (never synthesize a run_id).

Cross-repo (DRC-3532): the recce-cloud summary agent prompt is updated to emit
the markers; server-side marker replacement already exists in recce_task_func.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.26425% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
recce/mcp_server.py 66.66% 13 Missing ⚠️
Files with missing lines Coverage Δ
recce/tasks/profile.py 83.33% <100.00%> (+0.28%) ⬆️
tests/tasks/test_profile.py 99.13% <100.00%> (+0.21%) ⬆️
tests/test_mcp_cloud_backend.py 100.00% <100.00%> (ø)
tests/test_mcp_server.py 99.87% <100.00%> (+<0.01%) ⬆️
recce/mcp_server.py 90.12% <66.66%> (-1.15%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

iamcxa and others added 2 commits June 13, 2026 00:05
…s state

Local-mode ad-hoc diff tools (row_count_diff, profile_diff, query,
query_diff, value_diff) now route through _tool_run_backed_local, which
uses the same submit_run machinery as the recce server's run_router to
persist Runs to default_context().runs.  The tool result carries run_id
alongside the diff output, matching the CloudBackend._tool_run_backed
shape (commit 98670f9).

When default_context() is None (cloud SSE path already handled by
CloudBackend), the method falls back to direct task execution without
run_id — identical to pre-DRC-3634 behaviour, no double execution.

The MCP server remains the sole state exporter; runs created mid-session
via _tool_run_backed_local are included in the exported payload because
they land in the same in-process RecceContext.runs list that
export_state() serialises.

Tests: 12 new tests in TestLocalModeRunBacked covering run_id presence,
run persistence in context.runs, run type mapping (query/query_base), and
the isolation boundary (value_diff_detail stays outside run-backed scope).
175 MCP tests green (135 server + 40 cloud backend); ruff clean.

Signed-off-by: Kent <iamcxa@gmail.com>
…ding warehouses

profile_diff/profile filtered the requested `columns` with a case-sensitive
membership test (`column.name in selected_columns`). get_columns() returns
physical catalog names, which case-folding warehouses (e.g. Snowflake) store
UPPERCASE, while the Recce Cloud summary agent supplies lowercase
manifest-convention names. The exact-case filter dropped every column, so the
per-column profiling loop never ran and the tool returned a completely empty
result ({"base":{"columns":[],"data":[]},"current":{"columns":[],"data":[]}}).
This made the cloud summary non-deterministic: lowercase column names yielded an
empty profile that the downstream trust gate suppressed.

Fix: lowercase both sides of the membership test before filtering, mirroring the
case-insensitive normalisation in valuediff's _build_column_case_lookup. The SQL
path is unchanged — it still profiles via the physical column.name — so lowercase
input now resolves to the same physical (uppercase) columns and returns the same
non-empty profile as uppercase input. No-op on already-lowercase duckdb models and
on quoted/case-sensitive identifiers; applied to both ProfileDiffTask and
ProfileTask.

Adds regression tests using an UPPERCASE duckdb CSV header (duckdb preserves
header casing, reproducing Snowflake physical-name casing end-to-end with no
mocking): asserts lowercase and uppercase column requests return identical
non-empty profiles, plus an adapter-safety no-op test on lowercase physical names.

DRC-3674

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kent <iamcxa@gmail.com>
@iamcxa iamcxa requested a review from kentwelcome June 15, 2026 04:05
iamcxa added 2 commits June 15, 2026 12:08
…sertions

test_tool_row_count_diff and test_tool_query_with_base_flag asserted exact
equality with the mock result, but run-backed local mode (DRC-3532, _tool_run_backed_local)
now merges a run_id key into the tool result. Compare on the result fields
ignoring run_id; the dedicated TestLocalModeRunBacked tests cover run_id surfacing.
Fixes the Test DBT Versions CI failure on PR #1418. 135 test_mcp_server tests green.

Signed-off-by: Kent <iamcxa@gmail.com>

@kentwelcome kentwelcome left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review: 1 blocker (silent error swallowing in local run-backed tools), 2 issues. See review comment.

@kentwelcome

kentwelcome commented Jun 15, 2026

Copy link
Copy Markdown
Member

Code Review: PR #1418

SHA 2e8dd51d · Verdict NO-GO

Blockers

  1. recce/mcp_server.py:613-624 — Failed runs return a success-shaped result, silently swallowing the error.
    Evidence: submit_run.fn (recce/apis/run_func.py:245-258) catches every exception except DuckDBExternalAccessBlocked, stores it in run.error, and return None — so await asyncio.wrap_future(future) (line 611) never raises. _tool_run_backed_local then reads only run.result (None → {}) and never inspects run.status/run.error, returning {"run_id": ...}. Before this PR, task.execute() raised and call_tool re-raised (mcp_server.py:1473) so the MCP SDK set isError=True. Now a failing query/query_diff/profile_diff/value_diff/row_count_diff in local mode (bad SQL, missing model, permission denied) returns an empty success — the agent loses both the error and the _classify_db_error classification (TABLE_NOT_FOUND / PERMISSION_DENIED / SYNTAX_ERROR), and a citation agent can report "empty/no diff" for a query that actually errored.
    Pass D.

    Concrete example — agent calls query_diff with a non-existent column:

    query_diff({"sql_template": "SELECT bad_col FROM {{ ref('orders') }}", "primary_keys": ["id"]})
    
    • Before: QueryDiffTask.execute() raises duckdb BinderException('Referenced column "bad_col" not found'); call_tool re-raises → MCP response isError=True carrying the message. Agent knows it failed.
    • After: submit_run.fn catches the BinderException (it is not DuckDBExternalAccessBlocked), sets run.status=FAILED / run.error="Referenced column ... not found" / run.result=None, and returns None. Future resolves cleanly; _tool_run_backed_local maps run.result is None → {} and returns a success result:
      { "run_id": "a1b2c3d4-..." }
      No data, no error. The summary agent reads this as "ran fine, empty diff" and can emit {{run:a1b2c3d4}} citing a run that actually errored.

    Exact culprit lines (the failure is the absence of a status check across three lines):

    609    run, future = _submit_run_fn(run_type, params, triggered_by="recce_ai")
    610    # Await the executor future so run.result is populated before we return.
    611    await asyncio.wrap_future(future)        # <-- discards return value, never raises:
                                                    #     submit_run.fn returns None on error,
                                                    #     so the future resolves "successfully"
    612    # Normalise run.result to a plain dict (tasks may return Pydantic models).
    613    raw_result = run.result
    614    if raw_result is None:                    # <-- FAILED run (result=None, msg in run.error)
    615        result: Dict[str, Any] = {}           #     collapses into empty {}, status/error ignored
    ...
    624    return {**result, "run_id": str(run.run_id)}   # <-- success-shaped dict returned unconditionally
    • Line 611 awaits but never raises — the error signal is lost here (the line 610 comment assumes success).
    • Lines 614-615 treat a FAILED run identically to a genuinely-empty result; nothing reads run.status/run.error.
    • Line 624 returns {"run_id": ...} regardless of failure.

    Lines 614-615 are not wrong in isolation — a successful task can legitimately return empty. The bug is that nothing distinguishes "succeeded empty" from "failed," because line 611 already swallowed that signal.

    Suggested fix — add a guard between 611 and 613:

    await asyncio.wrap_future(future)
    if run.status == RunStatus.FAILED:
        raise RecceException(run.error or f"{run_type} run failed")
    raw_result = run.result

    This re-raises so call_tool (mcp_server.py:1450-1474) restores isError=True, while still leaving the persisted Run in the context for citation.

Issues

  1. recce/mcp_server.py (tool descriptions) — The 5 run-backed tools now return a top-level run_id, but none of their Tool(...description=...) blocks document it.
    Evidence: grep run_id over the descriptions finds only run_check (line 1207, unrelated). CLAUDE.md → "MCP tool description = LLM agent contract. Description MUST match actual response format" and "Prefer additive (_meta) over modifying existing field types." A new top-level field is a response-shape change the contract doesn't declare.
    Pass C.

  2. PR description omits the DRC-3634 local-mode rewrite — the largest change in the diff.
    Evidence: the "What" section describes only CloudBackend._tool_run_backed merging run_id (additive). The actual diff reroutes row_count_diff/query/query_diff/profile_diff/value_diff through a new _tool_run_backed_local that executes via submit_run, persists Runs to the in-process context, and changes return shape + error semantics. That behavior change is documented only in inline # DRC-3634 comments — exactly where the Blocker hides.
    Pass F.

Notes

  • tests/test_mcp_server.py::TestLocalModeRunBacked covers only happy paths (every task.execute is mocked to return a model). No test exercises a raising task, so the error-swallowing regression above is uncaught by the suite. Adding a "task raises → tool surfaces error (isError), not {run_id}" case would have flagged it. Pass E.
  • _tool_run_backed_local fallback branch (default_context() is None, lines 591-606) preserves the old raise-on-error behavior, but in live local mode default_context() is set, so the swallowing path (lines 609-624) is the one that runs.

Limits

  • Local environment is missing the mcp and dbt modules (pytest.importorskip("mcp") skipped the MCP suites; tests/tasks conftest fails on No module named 'dbt'). Could not execute the test suite here — author reports 40 cloud-backend + 123 mcp_server green; CI gates. Findings above are from static tracing, not a failed test run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants