Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
66767ee
Add AI-powered dataset populate with web search and CRUD tools
May 22, 2026
1429f0d
Address CodeRabbit review: authz, logging, timeouts, type alignment
May 22, 2026
04435f8
Move dataset ownership check inside try/catch for error handling
May 22, 2026
4de7ad7
Stabilize populate agent branch
giaphutran12 May 22, 2026
095c1b1
Add Mastra populate benchmark runtime
giaphutran12 May 22, 2026
9b6df1f
Ignore benchmark result artifacts
giaphutran12 May 22, 2026
fe55fc2
Add structured row recovery for Mastra populate
giaphutran12 May 22, 2026
72bb0ba
Add Mastra populate self-healing runtime
giaphutran12 May 22, 2026
823fa38
Wire Mastra populate through self-healing
giaphutran12 May 22, 2026
0efaf9d
Add self-healing populate cron runner
giaphutran12 May 22, 2026
17c4b97
Load self-healing cron context by dataset id
giaphutran12 May 22, 2026
21ca069
Add self-healing stack verifier
giaphutran12 May 22, 2026
f0d89b7
Document data collection agent migration plan
giaphutran12 May 22, 2026
1b2af8b
Add collection populate runtime adapter
giaphutran12 May 22, 2026
eeebdc4
Wire populate runtime selection
giaphutran12 May 22, 2026
6cacc56
Add collection self-healing benchmark lane
giaphutran12 May 22, 2026
aa4bb53
Refresh collection migration handoff plan
giaphutran12 May 22, 2026
346a20e
Address migration plan review gaps
giaphutran12 May 22, 2026
41767eb
Carry benchmark metadata through collection contract
giaphutran12 May 22, 2026
c2383b1
Load collection runner modules from runtime env
giaphutran12 May 22, 2026
ca90366
Port collection pipeline runner into self-healing path
giaphutran12 May 22, 2026
d476174
Harden collection runner wiring
giaphutran12 May 22, 2026
5d6a5f3
Bound collection agent runtime defaults
giaphutran12 May 22, 2026
4aaa209
Pass collection Agent timeout per run
giaphutran12 May 22, 2026
0f7c48e
Improve collection source targeting
giaphutran12 May 22, 2026
514591d
Surface collection capability diagnostics
giaphutran12 May 22, 2026
3cb4146
Document collection agent canary result
giaphutran12 May 22, 2026
cef8d39
Improve collection source coherence
giaphutran12 May 22, 2026
4265d23
Improve collection evidence support
giaphutran12 May 22, 2026
3348ae3
Fix collection URL-field source evidence
giaphutran12 May 22, 2026
14db4e2
Merge branch 'main' into codex/collection-official-website-sources
MMeteorL May 23, 2026
f00b26c
Document branch lineage after merging main
MMeteorL May 23, 2026
acc180b
Changes made to improve the mastra agent performance, add sourceUrl p…
MMeteorL May 23, 2026
f4111e8
fixed some issues introduced by memory
MMeteorL May 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,19 @@ CLERK_SECRET_KEY=sk_test_...
# Generate at https://openrouter.ai/settings/keys
OPENROUTER_API_KEY=sk-or-...

# TinyFish — required by populate agent web search/fetch.
# Generate at https://agent.tinyfish.ai/api-keys
TINYFISH_API_KEY=

# Generate once after the first `make dev` with:
# docker compose exec convex ./generate_admin_key.sh
# Used by the backend container to call internal Convex functions.
CONVEX_SELF_HOSTED_ADMIN_KEY=

# Durable store for self-healing populate recipe manifests.
# Docker dev overrides this to /app/.bigset/populate-recipes on a named volume.
POPULATE_RECIPE_STORE_DIR=.bigset/populate-recipes

# PostHog (optional — leave blank to disable analytics entirely in local dev).
# Get from https://us.posthog.com/project/settings/general.
NEXT_PUBLIC_POSTHOG_KEY=
Expand Down
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
.DS_Store
node_modules/
backend/node_modules
.env
.env.local
Project_BigSet_brief.md
Expand All @@ -14,16 +15,22 @@ Project_BigSet_brief.md
*.log
npm-debug.log*
yarn-debug.log*
/benchmark-results/

# Local-only files
*.bak
tmp/
temp/

.mastra
.bigset/

# Accidental root-level npm install (use backend/ and frontend/ package managers).
/package.json
/package-lock.json

# Local tarballs
*.tgz

# Internal docs
BigSet Technical Specs & Goals.md
BigSet Technical Specs & Goals.md
8 changes: 7 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Backend is Fastify + Mastra. Fastify serves the HTTP API (Clerk JWT auth on prot

The schema inference pipeline: frontend calls `POST /infer-schema` → Fastify verifies the Clerk JWT → calls `inferSchema()` in `backend/src/pipeline/schema-inference.ts` → Claude Sonnet 4.6 via OpenRouter → returns a Zod-validated `DatasetSchema` → frontend maps it to editable columns in the wizard.

The populate pipeline: frontend calls `POST /populate` with `{ datasetId, datasetName, description, columns }` → Fastify verifies the Clerk JWT → triggers `populateWorkflow` which: (1) clears existing rows, (2) builds a prompt from the schema, (3) runs the populate agent (Claude Sonnet 4.6) which searches the web via TinyFish APIs, then inserts rows into Convex one by one. Rows appear in realtime on the frontend via Convex reactive queries.
The populate pipeline: frontend calls `POST /populate` with `{ datasetId, datasetName, description, columns }` → Fastify verifies the Clerk JWT → runs the self-healing populate service. The service builds or reuses a recipe, runs the Mastra populate runtime against TinyFish search/fetch, validates source-backed rows, repairs bad recipes, promotes the passing recipe, then atomically replaces the dataset rows in Convex. Rows appear in realtime on the frontend via Convex reactive queries.

Convex functions use `ctx.auth.getUserIdentity()` to get the authenticated user. The `ownerId` field on datasets stores `identity.subject` (Clerk user ID). Do not pass `ownerId` from the client.

Expand All @@ -49,4 +49,10 @@ Convex is self-hosted — it does NOT hot-reload when you edit files in `fronten

In CI/prod, run `npx convex deploy` with `CONVEX_SELF_HOSTED_URL` and `CONVEX_SELF_HOSTED_ADMIN_KEY` set as env vars.

## Self-Healing Verification

Run `make verify-self-healing` before handing the stack to another agent. It runs backend tests, backend build, adapter syntax checks, and a no-key benchmark smoke that should block cleanly without spending API credits.

Use `bash scripts/verify-self-healing-stack.sh --real-benchmark` for the 2-prompt real Mastra benchmark, and `bash scripts/verify-self-healing-stack.sh --convex-push --dataset-id <dataset-id>` for a live app dataset dry-run. Export the required env vars before live modes; the verifier does not parse secret files itself. Add `--commit` only when you intentionally want to replace rows.

This is an open-source (AGPL) project. Do not commit secrets, API keys, or internal docs.
36 changes: 35 additions & 1 deletion backend/.env.example
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
CLIENT_ORIGIN=http://localhost:3500
CONVEX_URL=http://localhost:3210
PORT=3501
POPULATE_RECIPE_STORE_DIR=.bigset/populate-recipes

# Required once the backend starts writing rows via internal Convex mutations.
# Generate with: docker compose exec convex ./generate_admin_key.sh
Expand All @@ -11,10 +12,43 @@ CONVEX_SELF_HOSTED_ADMIN_KEY=
CLERK_SECRET_KEY=
CLERK_PUBLISHABLE_KEY=

# OpenRouter API key — required by schema inference.
# OpenRouter API key — required by schema inference and populate.
# Generate at https://openrouter.ai/settings/keys
OPENROUTER_API_KEY=sk-or-...

# TinyFish API key — used by the populate agent for web search and fetch.
# Generate at https://agent.tinyfish.ai/api-keys
TINYFISH_API_KEY=

# Optional model overrides (see backend/src/openrouter-models.ts).
# Schema inference defaults to anthropic/claude-sonnet-4-6.
# Populate and other non-inference tasks default to google/gemini-3.1-flash-lite.
# OPENROUTER_MODEL=google/gemini-3.1-flash-lite
# OPENROUTER_POPULATE_MODEL=google/gemini-3.1-flash-lite

# Populate runtime limits (see src/pipeline/populate-runtime-limits.ts).
# POPULATE_MAX_FETCH_CALLS caps prioritized source URLs for the populate agent (default 50).
# POPULATE_MAX_ROWS=100
# POPULATE_MAX_SEARCH_CALLS=25
# POPULATE_MAX_FETCH_CALLS=50

# Parallel populate workers (triage + extract per URL shard).
# POPULATE_URLS_PER_WORKER=5
# POPULATE_MAX_TINYFISH_AGENT_RUNS=5
# POPULATE_ENABLE_TINYFISH_AGENT=true
# POPULATE_TINYFISH_AGENT_POLL_TIMEOUT_MS=480000

# Central collection memory (repair_loop placeholder + agent_visited_urls).
# Defaults to a sibling of POPULATE_RECIPE_STORE_DIR (e.g. .bigset/collection-memory).
# POPULATE_ENABLE_COLLECTION_MEMORY=true
# POPULATE_COLLECTION_MEMORY_DIR=.bigset/collection-memory
# POPULATE_MAX_REPAIR_LOOPS=3

# Playwright agent dock (default off; replays URLs with saved Tinyfish emitted_process).
# POPULATE_ENABLE_PLAYWRIGHT_AGENT=false
# POPULATE_MAX_PLAYWRIGHT_AGENT_RUNS=5
# POPULATE_PLAYWRIGHT_AGENT_POLL_TIMEOUT_MS=480000

# When true, Mastra populate benchmark runs write intermediate artifacts under
# BIGSET_BENCHMARK_ARTIFACT_DIR/debug/ (JSON + CSV). No effect when unset/false.
# POPULATE_BENCHMARK_DEBUG=false
11 changes: 11 additions & 0 deletions backend/.gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
node_modules/
dist/
.env
.env.local
*.tsbuildinfo
drizzle/

# Local populate / collection runtime (see POPULATE_RECIPE_STORE_DIR, POPULATE_COLLECTION_MEMORY_DIR).
# Root .gitignore also ignores .bigset/ when those dirs live there.

# BigSet_Data_Collection_Agent CLI output (default: runs/{run_id}/ under cwd).
runs/
memory/

# Mastra local dev output
.mastra/
Loading