Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
.DS_Store
node_modules/

# No root package.json — ignore accidental `npm install` at repo root.
# backend/package-lock.json is committed (npm ci in Docker). frontend uses bun.lock.
/package-lock.json
.npm-cache/
.env
.env.local
Expand All @@ -20,6 +24,7 @@ yarn-debug.log*
*.bak
tmp/
temp/
benchmark-results/

.mastra

Expand Down
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Any dataset. Any source. Always fresh. That's the idea.

## 🚀 Quick Start

**Prerequisites:** [Docker](https://docs.docker.com/get-docker/), [Make](https://www.gnu.org/software/make/), and a free [Clerk](https://dashboard.clerk.com) account
**Prerequisites:** [Docker](https://docs.docker.com/get-docker/), [Make](https://www.gnu.org/software/make/), [Node.js](https://nodejs.org) (for the Convex CLI on your machine), and a free [Clerk](https://dashboard.clerk.com) account

### 1. Clone and set up Clerk

Expand All @@ -56,7 +56,17 @@ cp .env.example .env

> **Optional:** to enable [PostHog](https://posthog.com) product analytics + session replay + error tracking, set `NEXT_PUBLIC_POSTHOG_KEY` and `NEXT_PUBLIC_POSTHOG_HOST`. Leave blank to disable cleanly (the app no-ops every event).

### 3. Start everything
### 3. Install frontend dependencies (host)

`make dev` deploys Convex functions from your machine (not inside Docker), so the `convex` package must be installed locally:

```bash
cd frontend
bun install # or: npm install
cd ..
```

### 4. Start everything

```bash
make dev
Expand All @@ -70,15 +80,15 @@ Once it's up:
- Convex dashboard: http://localhost:6791
- [Mastra Studio](https://mastra.ai) (workflow inspector): http://localhost:4111

### 4. Generate Convex admin key (first time only)
### 5. Generate Convex admin key (first time only)

```bash
docker compose exec convex ./generate_admin_key.sh
```

Paste the output into `.env` as `CONVEX_SELF_HOSTED_ADMIN_KEY`, then re-run `make dev`.

### 5. Load curated public datasets
### 6. Load curated public datasets

The landing page and the dashboard's "Curated" section read from a set of 9 system-owned datasets. Load them with:

Expand Down
104 changes: 104 additions & 0 deletions benchmarks/dataset-agent/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Dataset Agent Benchmark

Shared harness for scoring the Mastra populate stack (orchestrator + `investigate_row` subagents) against a fixed prompt pack.

## Run Mastra Populate

```bash
cd backend && npm ci

node benchmarks/dataset-agent/run-benchmark.mjs \
--system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'
```
Comment on lines +8 to +12

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix working-directory mismatch in run instructions.

Line 8 changes cwd to backend, but Lines 10-12 use root-relative paths. Running those snippets sequentially will fail path resolution.

💡 Suggested doc fix
-cd backend && npm ci
+cd backend && npm ci && cd ..
 
 node benchmarks/dataset-agent/run-benchmark.mjs \
   --system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd backend && npm ci
node benchmarks/dataset-agent/run-benchmark.mjs \
--system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'
```
cd backend && npm ci && cd ..
node benchmarks/dataset-agent/run-benchmark.mjs \
--system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/dataset-agent/README.md` around lines 8 - 12, The README currently
switches to the backend working directory with "cd backend && npm ci" but the
subsequent run command uses root-relative paths ("node
benchmarks/dataset-agent/run-benchmark.mjs" and the adapter path
"benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs"), causing path
resolution to fail; fix by making the working directory usage consistent: either
keep the "cd backend" step and update the run command to reference the backend
directory (prefix paths with "../" or "./backend/" as appropriate) or remove the
"cd backend" and run "npm ci" using "npm ci --prefix backend" so the run command
can stay root-relative—ensure the updated instructions consistently reference
the same cwd for the run-benchmark.mjs command and the
mastra-populate-adapter.mjs adapter path.


Requires `OPENROUTER_API_KEY` and `TINYFISH_API_KEY` in `.env` / `backend/.env.local`.

Open-ended prompts are slow (many subagent calls). Use a longer timeout when needed:

```bash
node benchmarks/dataset-agent/run-benchmark.mjs \
--timeout-ms 1800000 \
--prompt-ids yc-recent-batch-companies \
--system mastra='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/mastra-populate-adapter.mjs'
```

## Why stdout used to look empty

Production `search_web` / `fetch_page` log with `console.log`, which used to fill **stdout** and break JSON parsing. The adapter now:

1. Redirects all `console.log` to **stderr** during the run
2. Writes **only** the benchmark JSON to stdout via `process.stdout.write`
3. Snapshots `benchmark-payload.json` under the artifact dir after each subagent session and row insert (survives timeouts)

If stdout still cannot be parsed, `run-benchmark.mjs` falls back to `benchmark-payload.json` in the prompt artifact folder.

## Token usage (requirement 1)

Each orchestrator and investigate `agent.generate` call records:

- Per-session `usage` in `sessions/<nnn>-<kind>-<entity>.json`
- Rollups in `usage.json` and `benchmarkTrace.usage` / `usageByKind` inside the stdout payload

## Rows for scoring (requirement 2)

Rows are collected in an **in-memory store** inside the adapter (same shape as production inserts, without Convex). Scoring uses:

- `rows` in stdout / `benchmark-payload.json`
- `rows.json`, `rows.csv` in the artifact directory

## Stage artifacts (requirement 3)

Each prompt run writes under `benchmark-results/<run>/mastra/<nn>-<prompt-id>/`:

| File | Contents |
|------|----------|
| `user-prompt.txt` | Benchmark prompt text |
| `orchestrator-prompt.txt` | Full prompt passed to populate agent |
| `run-meta.json` | ids, columns, step limits |
| `sessions/001-orchestrator.json` | Orchestrator prompt, steps summary, usage, response |
| `sessions/002-investigate-<entity>.json` | Per-lead subagent prompt, parsed INSERTED/SUMMARY/CLUES/REASON, steps, usage |
| `inserts.json` | Each `insert_row` with session + cell data |
| `rows.json` / `rows.csv` | Final rows for review |
| `usage.json` | Total + per-kind + per-session token totals |
| `tool-logs.txt` | Redirected web-tool log lines |
| `run-report.json` | High-level run summary |
| `benchmark-payload.json` | Same object as stdout (updated incrementally) |

Set `BIGSET_MASTRA_BENCHMARK_DEBUG=true` to log the artifact path on stderr.

## Optional env

| Variable | Default | Purpose |
|----------|---------|---------|
| `BIGSET_MASTRA_BENCHMARK_MAX_STEPS` | `80` | Orchestrator step budget |
| `BIGSET_MASTRA_BENCHMARK_TARGET_ROWS` | `20` | Target rows mentioned in prompt |

## Smoke + unit tests

```bash
node benchmarks/dataset-agent/run-benchmark.mjs \
--prompt-ids latest-ai-blog-posts \
--system smoke='node benchmarks/dataset-agent/adapters/smoke-adapter.mjs'

node --test benchmarks/dataset-agent/run-benchmark.test.mjs
```

## Output contract (stdout)

```json
{
"rows": [],
"validationIssues": [],
"usage": { "promptTokens": 0, "completionTokens": 0, "totalTokens": 0 },
"metrics": { "searchCalls": 0, "fetchCalls": 0, "browserCalls": 0, "agentRuns": 0, "agentSteps": 0 },
"benchmarkTrace": {
"sessionCount": 0,
"insertCount": 0,
"usage": {},
"usageByKind": { "orchestrator": {}, "investigate": {} },
"sessions": []
}
}
```

Delete the `benchmarks/` folder to remove all benchmark tooling from the repo — no `backend/src` benchmark code is required.
1 change: 1 addition & 0 deletions benchmarks/dataset-agent/adapters/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
local-*.mjs
Loading