Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions backend/src/mastra/agents/investigate.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import { Agent } from "@mastra/core/agent";
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
import { wrapModelWithTokenLimit } from "../model-wrapper.js";
import { buildPopulateTools } from "../tools/dataset-tools.js";
import { searchWebTool, fetchPageTool } from "../tools/web-tools.js";
import type { AuthContext } from "../workflows/populate.js";
Expand Down Expand Up @@ -28,18 +29,19 @@ RULES:
- You have at most 6 tool calls total. Budget them: 1 fetch + 1 search + 1 fetch + 1 insert = done.
- ALWAYS insert a row, even if some fields are incomplete. Use "" for unknown fields. Partial real data is better than no row.
- Never fabricate values. Use "" for anything you cannot verify.
- For every field value you extract and fill in "data", you MUST record the cell-level provenance (the source URL, the search query used to find it, and the exact text snippet context showing the value) in the "provenance" parameter of insert_row/update_row.
- insert_row rejects duplicates based on primary key columns. If you get a "Duplicate" error, do NOT retry — report INSERTED: false and move on.

TOOL CALL FORMAT — every tool call argument must be a JSON object wrapped in curly braces:
search_web: {"query": "your search terms"}
fetch_page: {"url": "https://example.com"}
insert_row: {"data": {${columnNames.map((n) => `"${n}": "value"`).join(", ")}}, "sources": ["https://url-you-fetched.com"], "row_summary": "one line about this entity", "how_found": "step by step guide on how to extract the data so an agent in the future can do it too"}
insert_row: {"data": {${columnNames.map((n) => `"${n}": "value"`).join(", ")}}, "sources": ["https://url-you-fetched.com"], "provenance": {${columnNames.map((n) => `"${n}": {"url": "https://url-you-fetched.com", "query": "search query used", "snippet": "exact context snippet from page"}`).join(", ")}}, "row_summary": "one line about this entity", "how_found": "step by step guide on how to extract the data so an agent in the future can do it too"}

WORKFLOW:
1. Fetch 1-2 of the provided URLs to get real data (if URLs were given).
2. If you need more, run ONE search and fetch the best result.
3. Call insert_row with whatever real data you have. Use "" for missing fields.
Include "sources" (URLs you fetched), "row_summary" (one line about this entity), and "how_found" (a step by step guide on how you found this data. eg, 1. fetch the contents of this url "<insert url>", 2. Look for the pricing field, and title name field, 3. etc...)
Include "sources" (URLs you fetched), "provenance" (mapping of column names to their detailed source details), "row_summary" (one line about this entity), and "how_found" (a step by step guide on how you found this data. eg, 1. fetch the contents of this url "<insert url>", 2. Look for the pricing field, and title name field, 3. etc...)
4. Write your final response:
INSERTED: true/false
SUMMARY: one line
Expand Down Expand Up @@ -70,7 +72,7 @@ export function buildInvestigateAgent(
id: "investigate-agent",
name: "Dataset Investigate Agent",
instructions: buildInvestigateInstructions(columns),
model: openrouter(modelSlug),
model: wrapModelWithTokenLimit(openrouter(modelSlug)),

tools: {
insert_row,
Expand Down
3 changes: 2 additions & 1 deletion backend/src/mastra/agents/populate.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import { Agent } from "@mastra/core/agent";
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
import { wrapModelWithTokenLimit } from "../model-wrapper.js";
import { buildSubagentTool } from "../tools/investigate-tool.js";
import { searchWebTool, fetchPageTool } from "../tools/web-tools.js";
import type { AuthContext } from "../workflows/populate.js";
Expand Down Expand Up @@ -53,7 +54,7 @@ export function buildPopulateAgent(
id: "populate-agent",
name: "Dataset Populate Orchestrator",
instructions: buildInstructions(maxRowCount),
model: openrouter(modelSlug),
model: wrapModelWithTokenLimit(openrouter(modelSlug)),
tools: {
search_web: searchWebTool,
fetch_page: fetchPageTool,
Expand Down
3 changes: 2 additions & 1 deletion backend/src/mastra/agents/refresh.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import { Agent } from "@mastra/core/agent";
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
import { wrapModelWithTokenLimit } from "../model-wrapper.js";
import { buildPopulateTools } from "../tools/dataset-tools.js";
import { searchWebTool, fetchPageTool } from "../tools/web-tools.js";
import type { AuthContext } from "../workflows/populate.js";
Expand Down Expand Up @@ -64,7 +65,7 @@ export function buildRefreshAgent(
id: "refresh-agent",
name: "Dataset Refresh Agent",
instructions: buildRefreshInstructions(columns),
model: openrouter("qwen/qwen3.7-max"),
model: wrapModelWithTokenLimit(openrouter("qwen/qwen3.7-max")),
tools: {
update_row,
search_web: searchWebTool,
Expand Down
62 changes: 62 additions & 0 deletions backend/src/mastra/model-wrapper.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import test from "node:test";
import assert from "node:assert";
import { wrapModelWithTokenLimit } from "./model-wrapper.js";

test("wrapModelWithTokenLimit - doGenerate intercepts and caps maxTokens", async () => {
let receivedOptions: any = null;

const mockModel: any = {
provider: "test-provider",
modelId: "test-model",
doGenerate: async (options: any) => {
receivedOptions = options;
return { text: "mock response" };
},
doStream: async (options: any) => {
receivedOptions = options;
return { stream: "mock stream" };
},
};

const wrapped = wrapModelWithTokenLimit(mockModel, 4096);

// 1. Default maxTokens when not provided
await wrapped.doGenerate({ prompt: "hello" });
assert.strictEqual(receivedOptions.maxTokens, 4096);

// 2. Cap maxTokens when it exceeds the limit
await wrapped.doGenerate({ prompt: "hello", maxTokens: 99999 });
assert.strictEqual(receivedOptions.maxTokens, 4096);

// 3. Keep maxTokens when it is below the limit
await wrapped.doGenerate({ prompt: "hello", maxTokens: 1000 });
assert.strictEqual(receivedOptions.maxTokens, 1000);

// 4. Test doStream default
await wrapped.doStream({ prompt: "hello" });
assert.strictEqual(receivedOptions.maxTokens, 4096);

// 5. Test doStream cap
await wrapped.doStream({ prompt: "hello", maxTokens: 99999 });
assert.strictEqual(receivedOptions.maxTokens, 4096);

// 6. Test doStream keep below limit
await wrapped.doStream({ prompt: "hello", maxTokens: 1000 });
assert.strictEqual(receivedOptions.maxTokens, 1000);
});

test("wrapModelWithTokenLimit - forwards properties and binds functions", () => {
const mockModel: any = {
provider: "test-provider",
modelId: "test-model",
someFunc() {
return this.provider;
},
};

const wrapped = wrapModelWithTokenLimit(mockModel, 4096);

assert.strictEqual(wrapped.provider, "test-provider");
assert.strictEqual(wrapped.modelId, "test-model");
assert.strictEqual(wrapped.someFunc(), "test-provider");
});
40 changes: 40 additions & 0 deletions backend/src/mastra/model-wrapper.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/**
* Wraps a LanguageModel with a Proxy to cap or default the maxTokens parameter.
* This prevents OpenRouter 402 errors due to requesting the default 65535 maxTokens.
*/
export function wrapModelWithTokenLimit(
model: any,
maxTokensLimit: number = 8192,
): any {
return new Proxy(model, {
get(target, prop, receiver) {
if (prop === "doGenerate") {
return async function (options: any) {
const modifiedOptions = { ...options };
if (typeof modifiedOptions.maxTokens === "number") {
modifiedOptions.maxTokens = Math.min(modifiedOptions.maxTokens, maxTokensLimit);
} else {
modifiedOptions.maxTokens = maxTokensLimit;
}
return target.doGenerate(modifiedOptions);
};
}
if (prop === "doStream") {
return async function (options: any) {
const modifiedOptions = { ...options };
if (typeof modifiedOptions.maxTokens === "number") {
modifiedOptions.maxTokens = Math.min(modifiedOptions.maxTokens, maxTokensLimit);
} else {
modifiedOptions.maxTokens = maxTokensLimit;
}
return target.doStream(modifiedOptions);
};
}
const val = Reflect.get(target, prop, receiver);
if (typeof val === "function") {
return val.bind(target);
}
return val;
},
});
}
28 changes: 26 additions & 2 deletions backend/src/mastra/tools/dataset-tools.ts
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,17 @@ export function buildPopulateTools(
.array(z.string())
.optional()
.describe("URLs you visited or used to gather data for this row"),
provenance: z
.record(
z.string(),
z.object({
url: z.string(),
query: z.string().optional(),
snippet: z.string().optional(),
})
)
.optional()
.describe("Mapping of column names to their detailed source provenance (url, query, snippet)"),
row_summary: z
.string()
.optional()
Expand All @@ -141,7 +152,7 @@ export function buildPopulateTools(
.describe("Brief description of how you found and verified this data"),
}),
outputSchema: writeResultSchema,
execute: async ({ data, sources, row_summary, how_found }) => {
execute: async ({ data, sources, provenance, row_summary, how_found }) => {
if (!data || Object.keys(data).length === 0)
return {
success: false,
Expand All @@ -158,6 +169,7 @@ export function buildPopulateTools(
datasetId: authorizedDatasetId,
data: cleanedData,
...(sources !== undefined ? { sources } : {}),
...(provenance !== undefined ? { provenance } : {}),
...(row_summary !== undefined ? { rowSummary: row_summary } : {}),
...(how_found !== undefined ? { howFound: how_found } : {}),
});
Expand Down Expand Up @@ -265,6 +277,17 @@ export function buildPopulateTools(
.array(z.string())
.optional()
.describe("Updated source URLs where this data was verified"),
provenance: z
.record(
z.string(),
z.object({
url: z.string(),
query: z.string().optional(),
snippet: z.string().optional(),
})
)
.optional()
.describe("Updated mapping of column names to their detailed source provenance (url, query, snippet)"),
row_summary: z
.string()
.optional()
Expand All @@ -275,7 +298,7 @@ export function buildPopulateTools(
.describe("Brief description of how the updated data was found"),
}),
outputSchema: writeResultSchema,
execute: async ({ rowId, data, sources, row_summary, how_found }) => {
execute: async ({ rowId, data, sources, provenance, row_summary, how_found }) => {
if (!rowId) return { success: false, error: "rowId is required." };
if (!data || Object.keys(data).length === 0)
return {
Expand All @@ -293,6 +316,7 @@ export function buildPopulateTools(
expectedDatasetId: authorizedDatasetId,
data: cleanedData,
...(sources !== undefined ? { sources } : {}),
...(provenance !== undefined ? { provenance } : {}),
...(row_summary !== undefined ? { rowSummary: row_summary } : {}),
...(how_found !== undefined ? { howFound: how_found } : {}),
});
Expand Down
3 changes: 2 additions & 1 deletion backend/src/mastra/workflows/populate.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ import { createStep, createWorkflow } from "@mastra/core/workflows";
import { z } from "zod";
import { generateText } from "ai";
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
import { wrapModelWithTokenLimit } from "../model-wrapper.js";
import { datasetContextSchema, populateColumnSchema } from "../../pipeline/populate.js";
import { convex, internal } from "../../convex.js";
import { DEFAULT_MODEL_IDS } from "../../config/models.js";
Expand Down Expand Up @@ -114,7 +115,7 @@ Respond with EXACTLY one word: scraper or search`;
const modelSlug =
inputData.authContext?.modelConfig?.schemaInference ?? DEFAULT_MODEL_IDS.SCHEMA_INFERENCE;
const result = await generateText({
model: openrouter(modelSlug),
model: wrapModelWithTokenLimit(openrouter(modelSlug)),
prompt: classificationPrompt,
maxOutputTokens: 10,
abortSignal: getSignal(inputData.datasetId),
Expand Down
3 changes: 2 additions & 1 deletion backend/src/pipeline/schema-inference.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import { generateText, Output, NoObjectGeneratedError } from "ai";
import { createOpenRouter } from "@openrouter/ai-sdk-provider";
import { wrapModelWithTokenLimit } from "../mastra/model-wrapper.js";

import { DEFAULT_MODEL_IDS } from "../config/models.js";
import { datasetSchemaSchema, type DatasetSchema } from "./types.js";
Expand Down Expand Up @@ -33,7 +34,7 @@ function getModel(modelSlug?: string) {
}
const openrouter = createOpenRouter({ apiKey });
const resolvedSlug = modelSlug ?? DEFAULT_MODEL_IDS.SCHEMA_INFERENCE;
return openrouter(resolvedSlug);
return wrapModelWithTokenLimit(openrouter(resolvedSlug));
}

export async function inferSchema(prompt: string, modelSlug?: string): Promise<DatasetSchema> {
Expand Down
13 changes: 12 additions & 1 deletion frontend/app/dataset/[id]/page.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,11 @@ export default function DatasetPage() {
column: DatasetColumn;
value: unknown;
sources?: string[];
provenance?: {
url: string;
query?: string;
snippet?: string;
};
} | null>(null);

const datasetId = params.id as Id<"datasets">;
Expand Down Expand Up @@ -103,7 +108,12 @@ export default function DatasetPage() {
const col = dataset.columns.find((c) => c.name === columnName);
if (!col) return;
const row = rows.find((r) => r._id === rowId);
setCellDetail({ column: col, value, sources: row?.sources });
setCellDetail({
column: col,
value,
sources: row?.sources,
provenance: row?.provenance?.[columnName],
});
}, [dataset, rows]);

const openedFired = useRef<string | null>(null);
Expand Down Expand Up @@ -462,6 +472,7 @@ export default function DatasetPage() {
column={cellDetail.column}
value={cellDetail.value}
sources={cellDetail.sources}
provenance={cellDetail.provenance}
/>
)}
</SideSheet>
Expand Down
67 changes: 66 additions & 1 deletion frontend/components/SideSheet.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,12 @@ interface CellDetailProps {
value: unknown;
/** Row-level sources stored by the populate agent. */
sources?: string[];
/** Cell-level provenance metadata. */
provenance?: {
url: string;
query?: string;
snippet?: string;
};
}

function isValidHttpUrl(src: string): boolean {
Expand All @@ -130,7 +136,7 @@ function isValidHttpUrl(src: string): boolean {
}
}

export function CellDetail({ column, value, sources }: CellDetailProps) {
export function CellDetail({ column, value, sources, provenance }: CellDetailProps) {
const [copied, setCopied] = useState(false);
const copyTimerRef = useRef<ReturnType<typeof setTimeout> | null>(null);
const displayValue = value == null || value === "" ? "—" : String(value);
Expand Down Expand Up @@ -192,6 +198,65 @@ export function CellDetail({ column, value, sources }: CellDetailProps) {
</div>
</div>

{/* Cell Provenance */}
{provenance && (
<div className="rounded-xl border border-emerald-500/15 bg-emerald-500/[0.02] p-4 space-y-3.5">
<div className="flex items-center gap-2 text-emerald-700 dark:text-emerald-400 font-medium text-xs">
<svg width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" strokeWidth="2" strokeLinecap="round" strokeLinejoin="round">
<path d="M12 22s8-4 8-10V5l-8-3-8 3v7c0 6 8 10 8 10z"/>
</svg>
<span>Verified Source Origin</span>
</div>

<div className="space-y-3">
{/* Source URL */}
<div>
<p className="text-[10px] font-semibold text-muted uppercase tracking-wider">Source URL</p>
{isValidHttpUrl(provenance.url) ? (
<a
href={provenance.url}
target="_blank"
rel="noopener noreferrer"
className="inline-flex items-center gap-1 text-xs text-link hover:underline break-all mt-0.5"
data-ph-mask-text="true"
>
<IconExternalLink />
{provenance.url}
</a>
) : (
<p className="text-xs text-foreground break-all mt-0.5" data-ph-mask-text="true">
{provenance.url}
</p>
)}
</div>

{/* Search Query */}
{provenance.query && (
<div>
<p className="text-[10px] font-semibold text-muted uppercase tracking-wider">Search Query Used</p>
<div className="inline-flex items-center gap-1 px-1.5 py-0.5 rounded bg-foreground/[0.04] border border-border/60 text-xs text-foreground/80 mt-1" data-ph-mask-text="true">
<svg width="10" height="10" viewBox="0 0 24 24" fill="none" stroke="currentColor" strokeWidth="2.5" strokeLinecap="round" strokeLinejoin="round" className="opacity-60">
<circle cx="11" cy="11" r="8"/><path d="m21 21-4.3-4.3"/>
</svg>
<span>{provenance.query}</span>
</div>
</div>
)}

{/* Text Snippet */}
{provenance.snippet && (
<div>
<p className="text-[10px] font-semibold text-muted uppercase tracking-wider mb-1">Snippet Context</p>
<div className="relative rounded-lg border border-border bg-background px-3 py-2 text-xs italic text-foreground/80 leading-relaxed" data-ph-mask-text="true">
<span className="absolute left-2.5 top-1.5 text-foreground/10 text-2xl font-serif leading-none">&ldquo;</span>
<p className="pl-4 pr-1">{provenance.snippet}</p>
</div>
</div>
)}
</div>
</div>
)}

{/* Sources */}
{sources && sources.length > 0 && (
<div>
Expand Down
Loading