Skip to content

feat: implement per-cell source provenance#130

Open
pritpatel2412 wants to merge 2 commits into
tinyfish-io:mainfrom
pritpatel2412:feat/per-cell-provenance
Open

feat: implement per-cell source provenance#130
pritpatel2412 wants to merge 2 commits into
tinyfish-io:mainfrom
pritpatel2412:feat/per-cell-provenance

Conversation

@pritpatel2412

Copy link
Copy Markdown

This pull request implements per-cell source provenance across the database schema, backend workflow tools, agent prompts, and frontend UI. It allows the web search subagents to record specific URLs, queries, and snippets for individual cell extractions, and renders indicators in the dataset table and detailed origin information in the SideSheet.

Key changes:

  • Extended the Convex datasetRows schema and mutations to store cell-level provenance data.
  • Updated Mastra tools (insert_row, update_row) to accept and pass the provenance mapping.
  • Guided subagents via system prompts to research and supply precise cell-level provenance context.
  • Added a visual indicator (emerald dot) next to cells containing provenance info.
  • Enhanced the SideSheet to display the URL, search query, and snippet blockquote under a "Verified Source Origin" card.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces cell-level provenance tracking across the data extraction and display pipeline. Backend agent instructions now require each extracted field to include source provenance (URL, query, snippet). Backend tools accept this provenance and forward it through Convex mutations. The frontend data model persists provenance per cell, and UI components render source origin details in cell detail panels with verified URL links, search queries, and snippet excerpts. Table cells display a small indicator when provenance is available. Supporting changes include cursor-based pagination for runStats queries and a timing fix for row change flash detection.

Possibly related PRs

  • tinyfish-io/bigset#26: Both PRs modify the populate/CRUD tool layer—specifically backend/src/mastra/tools/dataset-tools.ts insert_row/update_row behavior (retrieved adds the tools; main PR extends them to accept and persist per-column provenance).
  • tinyfish-io/bigset#104: Both PRs modify the investigate agent/tool-call contract and dataset row write helpers (insert_row/update_row) to require/provide extraction provenance metadata (main PR adds per-cell provenance with url/query/snippet; retrieved PR adds insert/update sources + row_summary + how_found), overlapping at the same instruction and row-mutation integration points.
  • tinyfish-io/bigset#115: Both PRs modify the cell-expand details UI on the dataset page—SideSheet/CellDetail and DataRow—with the main PR extending the existing sources-based side sheet to support per-cell provenance.

Suggested reviewers

  • simantak-dabhade
  • MMeteorL
  • hwennnn
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: implement per-cell source provenance' directly and clearly summarizes the main change across all modified files in the changeset.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing schema extensions, tool updates, UI enhancements, and specific implementation details across all modified components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
frontend/components/SideSheet.tsx (1)

130-137: 💤 Low value

Consider extracting URL validation to a shared utility.

The isValidHttpUrl helper here is similar to toSafeHttpUrl in CellValue.tsx. While the return types differ (boolean vs string|null), consider extracting the common validation logic to a shared utility module to reduce duplication and improve maintainability.

♻️ Potential shared utility approach

Create a lib/url-validation.ts file:

export function isValidHttpUrl(url: string): boolean {
  try {
    const { protocol } = new URL(url);
    return protocol === "http:" || protocol === "https:";
  } catch {
    return false;
  }
}

export function toSafeHttpUrl(url: string): string | null {
  return isValidHttpUrl(url) ? url : null;
}

Then import in both files.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/components/SideSheet.tsx` around lines 130 - 137, Extract the shared
HTTP URL validation into a new utility (e.g., lib/url-validation.ts) and replace
the inline helper in SideSheet.tsx and the logic in CellValue.tsx with imports;
specifically, move the current isValidHttpUrl implementation into a exported
isValidHttpUrl(url: string): boolean and add an exported toSafeHttpUrl(url:
string): string | null that calls isValidHttpUrl, then update SideSheet.tsx to
import and use isValidHttpUrl and update CellValue.tsx to import and use
toSafeHttpUrl so both files reuse the same validation logic.
frontend/components/table/use-row-change-detection.ts (1)

55-75: ⚡ Quick win

Consider starting the flash-off timer inside the flash-on callback for clearer sequencing.

The flash-on timer (line 56) and flash-off timer (line 66) are currently scheduled at the same time. While the event loop guarantees that setTimeout(0) fires before setTimeout(1500), starting the flash-off timer inside the flash-on callback would make the sequencing explicit and ensure the flash duration is exactly FLASH_DURATION_MS from when the flash actually appears.

♻️ Proposed refactor for clearer sequencing
     if (newFlashes.size > 0) {
       const updateTimer = setTimeout(() => {
         setFlashingCells((prev) => {
           const merged = new Set(prev);
           for (const key of newFlashes) merged.add(key);
           return merged;
         });
         flashTimersRef.current.delete(updateTimer);
+
+        const flashOffTimer = setTimeout(() => {
+          setFlashingCells((prev) => {
+            const next = new Set(prev);
+            for (const key of newFlashes) next.delete(key);
+            return next;
+          });
+          flashTimersRef.current.delete(flashOffTimer);
+        }, FLASH_DURATION_MS);
+        flashTimersRef.current.add(flashOffTimer);
       }, 0);
       flashTimersRef.current.add(updateTimer);
-
-      const timer = setTimeout(() => {
-        setFlashingCells((prev) => {
-          const next = new Set(prev);
-          for (const key of newFlashes) next.delete(key);
-          return next;
-        });
-        flashTimersRef.current.delete(timer);
-      }, FLASH_DURATION_MS);
-      flashTimersRef.current.add(timer);
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/components/table/use-row-change-detection.ts` around lines 55 - 75,
The flash-off timeout should be started inside the flash-on callback to make
sequencing explicit: inside the setTimeout callback that calls setFlashingCells
to add newFlashes (the "flash-on" callback), create the second setTimeout that
removes those keys after FLASH_DURATION_MS, add both timers to
flashTimersRef.current, and ensure each timer is deleted from
flashTimersRef.current when it fires; update the logic around newFlashes,
setFlashingCells, flashTimersRef, and FLASH_DURATION_MS accordingly so the off
timer is scheduled only after the on callback runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@frontend/convex/runStats.ts`:
- Line 110: The current limit computation in runStats.ts (the const limit =
Math.min(args.limit ?? DEFAULT_PAGE_SIZE, MAX_PAGE_SIZE);) doesn't prevent zero
or negative values; update the logic that computes limit (using args.limit,
DEFAULT_PAGE_SIZE and MAX_PAGE_SIZE) to enforce a positive integer (e.g., coerce
to a number and clamp with a lower bound of 1) before using it, and validate or
sanitize args.limit so limit is always >= 1 and <= MAX_PAGE_SIZE.
- Line 84: The limit calculation currently allows non-positive values from
args.limit to pass through; update the code around the variable limit (the line
using args.limit, DEFAULT_PAGE_SIZE, MAX_PAGE_SIZE) to validate and clamp
args.limit to a positive integer before applying Math.min—e.g., coerce
args.limit to a number, ensure it's at least 1 (or fall back to
DEFAULT_PAGE_SIZE) and then cap with MAX_PAGE_SIZE so paginate() always receives
a positive limit; modify the expression that computes limit to perform this
validation.

---

Nitpick comments:
In `@frontend/components/SideSheet.tsx`:
- Around line 130-137: Extract the shared HTTP URL validation into a new utility
(e.g., lib/url-validation.ts) and replace the inline helper in SideSheet.tsx and
the logic in CellValue.tsx with imports; specifically, move the current
isValidHttpUrl implementation into a exported isValidHttpUrl(url: string):
boolean and add an exported toSafeHttpUrl(url: string): string | null that calls
isValidHttpUrl, then update SideSheet.tsx to import and use isValidHttpUrl and
update CellValue.tsx to import and use toSafeHttpUrl so both files reuse the
same validation logic.

In `@frontend/components/table/use-row-change-detection.ts`:
- Around line 55-75: The flash-off timeout should be started inside the flash-on
callback to make sequencing explicit: inside the setTimeout callback that calls
setFlashingCells to add newFlashes (the "flash-on" callback), create the second
setTimeout that removes those keys after FLASH_DURATION_MS, add both timers to
flashTimersRef.current, and ensure each timer is deleted from
flashTimersRef.current when it fires; update the logic around newFlashes,
setFlashingCells, flashTimersRef, and FLASH_DURATION_MS accordingly so the off
timer is scheduled only after the on callback runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 99f3fe30-0595-4b1f-820e-debae85223d3

📥 Commits

Reviewing files that changed from the base of the PR and between 5a2e7c3 and 0103e93.

📒 Files selected for processing (11)
  • backend/src/mastra/agents/investigate.ts
  • backend/src/mastra/tools/dataset-tools.ts
  • frontend/app/dataset/[id]/page.tsx
  • frontend/components/SideSheet.tsx
  • frontend/components/table/DataRow.tsx
  • frontend/components/table/types.ts
  • frontend/components/table/use-row-change-detection.ts
  • frontend/convex/datasetRows.ts
  • frontend/convex/runStats.ts
  • frontend/convex/schema.ts
  • scripts/verify-authz.sh

},
handler: async (ctx, args) => {
const runs = await ctx.db
const limit = Math.min(args.limit ?? DEFAULT_PAGE_SIZE, MAX_PAGE_SIZE);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add validation to ensure limit is positive.

The current limit calculation allows negative or zero values to pass through if explicitly provided in args.limit. This could cause unexpected pagination behavior or errors in the underlying paginate() call.

🛡️ Suggested fix to enforce positive limit
-    const limit = Math.min(args.limit ?? DEFAULT_PAGE_SIZE, MAX_PAGE_SIZE);
+    const limit = Math.min(
+      Math.max(args.limit ?? DEFAULT_PAGE_SIZE, 1),
+      MAX_PAGE_SIZE
+    );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const limit = Math.min(args.limit ?? DEFAULT_PAGE_SIZE, MAX_PAGE_SIZE);
const limit = Math.min(
Math.max(args.limit ?? DEFAULT_PAGE_SIZE, 1),
MAX_PAGE_SIZE
);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/convex/runStats.ts` at line 84, The limit calculation currently
allows non-positive values from args.limit to pass through; update the code
around the variable limit (the line using args.limit, DEFAULT_PAGE_SIZE,
MAX_PAGE_SIZE) to validate and clamp args.limit to a positive integer before
applying Math.min—e.g., coerce args.limit to a number, ensure it's at least 1
(or fall back to DEFAULT_PAGE_SIZE) and then cap with MAX_PAGE_SIZE so
paginate() always receives a positive limit; modify the expression that computes
limit to perform this validation.

},
handler: async (ctx, args) => {
const runs = await ctx.db
const limit = Math.min(args.limit ?? DEFAULT_PAGE_SIZE, MAX_PAGE_SIZE);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add validation to ensure limit is positive.

Same issue as in listByDataset: the limit calculation allows negative or zero values to pass through, which could cause unexpected behavior.

🛡️ Suggested fix to enforce positive limit
-    const limit = Math.min(args.limit ?? DEFAULT_PAGE_SIZE, MAX_PAGE_SIZE);
+    const limit = Math.min(
+      Math.max(args.limit ?? DEFAULT_PAGE_SIZE, 1),
+      MAX_PAGE_SIZE
+    );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const limit = Math.min(args.limit ?? DEFAULT_PAGE_SIZE, MAX_PAGE_SIZE);
const limit = Math.min(
Math.max(args.limit ?? DEFAULT_PAGE_SIZE, 1),
MAX_PAGE_SIZE
);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/convex/runStats.ts` at line 110, The current limit computation in
runStats.ts (the const limit = Math.min(args.limit ?? DEFAULT_PAGE_SIZE,
MAX_PAGE_SIZE);) doesn't prevent zero or negative values; update the logic that
computes limit (using args.limit, DEFAULT_PAGE_SIZE and MAX_PAGE_SIZE) to
enforce a positive integer (e.g., coerce to a number and clamp with a lower
bound of 1) before using it, and validate or sanitize args.limit so limit is
always >= 1 and <= MAX_PAGE_SIZE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant