Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@ OPENROUTER_API_KEY=sk-or-...
# Used by the backend container to call internal Convex functions.
CONVEX_SELF_HOSTED_ADMIN_KEY=

# TinyFish — used by the backend's populate agent for web search and fetch.
# Generate at https://agent.tinyfish.ai/api-keys
TINYFISH_API_KEY=

# Resend (optional — transactional emails when a populate workflow finishes).
# Unset → email module logs and no-ops. Generate at https://resend.com/api-keys
RESEND_API_KEY=
EMAIL_FROM="BigSet <simantak@tinyfish.ai>"

# PostHog (optional — leave blank to disable analytics entirely in local dev).
# Get from https://us.posthog.com/project/settings/general.
NEXT_PUBLIC_POSTHOG_KEY=
Expand Down
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ Open [localhost:3500](http://localhost:3500) and click **Get started** to sign i
| Auth | [Clerk](https://clerk.com) |
| Database | [Convex](https://convex.dev) (self-hosted) |
| Data Collection | [TinyFish](https://tinyfish.ai) APIs (Search, Fetch, Browser) |
| Schema inference | [Mastra](https://mastra.ai) workflows + [Vercel AI SDK](https://sdk.vercel.ai) + [OpenRouter](https://openrouter.ai) → Claude Sonnet |
| AI orchestration | [Mastra](https://mastra.ai) workflows + [Vercel AI SDK](https://sdk.vercel.ai) + [OpenRouter](https://openrouter.ai) → Claude Sonnet (schema inference + populate agent) |
| Table view | [TanStack Table](https://tanstack.com/table) + [react-window](https://github.com/bvaughn/react-window) virtualization |
| Exports | CSV (built-in) + XLSX ([SheetJS](https://sheetjs.com), dynamic-imported) |
| Analytics | [PostHog](https://posthog.com) — events, session replay, error tracking (optional) |
Expand All @@ -124,9 +124,11 @@ bigset/
├── frontend/ Next.js 16 — UI + Convex schema & functions
│ ├── convex/ Convex functions, schema, authz + quota helpers
│ └── .env.local Clerk + Convex keys (not committed)
├── backend/ Fastify + Mastra — schema inference + (future) agents
│ ├── src/pipeline/ Pure schema-inference fn (called by Fastify + Mastra)
│ └── src/mastra/ Mastra workflows (Studio at :4111 in dev)
├── backend/ Fastify + Mastra — schema inference + populate agent
│ ├── src/pipeline/ Pure pipelines: schema inference + populate context
│ ├── src/mastra/ Mastra workflows, agents, and tools (Studio at :4111 in dev)
│ ├── src/email/ Transactional email (Resend) — sends "dataset ready" notifications
│ └── src/analytics/ Server-side PostHog wrapper for backend-only events
├── scripts/ One-off scripts (e.g. verify-authz.sh)
├── .env Clerk keys for docker-compose (not committed)
├── docker-compose.dev.yml
Expand Down
11 changes: 11 additions & 0 deletions backend/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,14 @@ OPENROUTER_API_KEY=sk-or-...
# TinyFish API key — used by the populate agent for web search and fetch.
# Generate at https://agent.tinyfish.ai/api-keys
TINYFISH_API_KEY=

# Resend (transactional email) — optional. When unset, the email module
# logs and no-ops. Generate at https://resend.com/api-keys
RESEND_API_KEY=
# Sender address. The domain must be verified in the Resend dashboard.
EMAIL_FROM="BigSet <simantak@tinyfish.ai>"

# PostHog server-side analytics — optional. Same project key as the
# frontend (phc_...). Used to track email-lifecycle events server-side.
POSTHOG_KEY=
POSTHOG_HOST=https://us.i.posthog.com
8 changes: 4 additions & 4 deletions backend/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,14 @@ The pipeline is a pure function (`inferSchema(prompt) → DatasetSchema`). It is

`src/mastra/` — wraps pipelines into Mastra workflows. Runs as a separate Docker service on :4111 with `mastra dev`, which provides a Studio UI for inspecting and testing workflows.

- `src/mastra/index.ts` — registers agents and workflows with the `Mastra` instance
- `src/mastra/index.ts` — registers workflows with the `Mastra` instance (the populate agent is built per-run, not registered as a singleton)
- `src/mastra/workflows/infer-schema.ts` — `inferSchemaWorkflow`, a single-step workflow wrapping `inferSchema()`
- `src/mastra/workflows/populate.ts` — `populateWorkflow`, 3-step workflow: clear rows → build prompt → run populate agent
- `src/mastra/agents/populate.ts` — `populateAgent`, an AI agent (Claude Sonnet 4.6 via OpenRouter) with 7 tools for database CRUD and web access
- `src/mastra/tools/dataset-tools.ts` — 5 Convex-backed tools: `insert_row`, `list_rows`, `get_row`, `update_row`, `delete_row`
- `src/mastra/agents/populate.ts` — `buildPopulateAgent(authorizedDatasetId, authContext)`, a factory that builds a dataset-scoped Claude Sonnet 4.6 agent with 7 tools for database CRUD and web access
- `src/mastra/tools/dataset-tools.ts` — `buildPopulateTools(authorizedDatasetId, authContext)` factory returning 5 Convex-backed tools: `insert_row`, `list_rows`, `get_row`, `update_row`, `delete_row`. The dataset id is captured by closure so the LLM cannot redirect writes to other datasets; `authContext` (Clerk userId + workflow run id) is captured for caller-attribution in security logs and the `CAPABILITY_VIOLATION` PostHog event. See the security note at the top of the file.
- `src/mastra/tools/web-tools.ts` — 2 TinyFish API tools: `search_web`, `fetch_page`

The populate agent uses `createStep(agent, { maxSteps: 80 })` to allow enough tool-call rounds for web research + row insertion.
The populate workflow builds a fresh agent per run via `buildPopulateAgent(...)` and calls `.generate(prompt, { maxSteps: 80 })` to allow enough tool-call rounds for web research + row insertion. Per-run construction is required by the capability-scoping security model (closure-bound dataset id); do not cache or share agents across runs.

All tools return structured error messages (not thrown exceptions) so the agent can self-correct.

Expand Down
75 changes: 63 additions & 12 deletions backend/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions backend/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
"dotenv": "^16.4.0",
"fastify": "^5.0.0",
"fastify-plugin": "^5.1.0",
"posthog-node": "^5.35.1",
"resend": "^6.12.3",
"zod": "^4.4.3"
},
"devDependencies": {
Expand Down
26 changes: 26 additions & 0 deletions backend/src/analytics/events.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
/**
* Backend-side event names. Past-tense snake_case, matching the
* frontend's `EVENTS` constant in `frontend/lib/analytics.ts`.
*
* These events fire from server-side code paths the frontend can't
* observe (e.g. the email actually leaving the building, not just the
* /populate response returning success).
*/
export const EVENTS = {
/** Resend accepted the email for delivery. */
DATASET_READY_EMAIL_SENT: "dataset_ready_email_sent",
/** Notify attempted but couldn't deliver — see `error_kind` property. */
DATASET_READY_EMAIL_FAILED: "dataset_ready_email_failed",
/**
* A populate-agent tool call was refused because the LLM tried to
* touch a row outside its authorized dataset (or fabricated an id).
*
* Fires per refused operation, never per success. Payload is
* deliberately small — see backend/src/mastra/tools/dataset-tools.ts.
* Useful as a leading indicator for prompt-injection attempts and as
* a regression signal if the closure-scoping discipline ever breaks.
*/
CAPABILITY_VIOLATION: "capability_violation",
} as const;

export type BackendEventName = (typeof EVENTS)[keyof typeof EVENTS];
72 changes: 72 additions & 0 deletions backend/src/analytics/posthog.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import { PostHog } from "posthog-node";
import { env } from "../env.js";
import type { BackendEventName } from "./events.js";

/**
* Server-side PostHog wrapper.
*
* Why a separate module from the frontend's `lib/analytics.ts`:
* - The backend fires events the frontend can't observe (e.g. the
* email actually being accepted by Resend, server-only failures).
* - Same PostHog project; same `phc_...` key. Events keyed by the
* Clerk userId associate to the same person the frontend already
* identified via `analytics-provider.tsx`.
*
* Behavior:
* - No-op when `POSTHOG_KEY` is unset (local dev without an account).
* - `flushAt: 1` ships events immediately. Low volume, simpler reasoning;
* no buffered events sitting in memory across restarts.
* - All `capture` calls are wrapped in try/catch — analytics failures
* must NEVER affect the request that triggered them.
*/

let client: PostHog | null = null;

function getClient(): PostHog | null {
if (client) return client;
if (!env.POSTHOG_KEY) return null;
client = new PostHog(env.POSTHOG_KEY, {
host: env.POSTHOG_HOST,
flushAt: 1,
});
return client;
}

export function isAnalyticsEnabled(): boolean {
return Boolean(env.POSTHOG_KEY);
}

/**
* Fire an event keyed to a Clerk user id. Safe to call without checking
* `isAnalyticsEnabled()` first — no-ops cleanly when disabled.
*/
export function capture(params: {
distinctId: string;
event: BackendEventName;
properties?: Record<string, unknown>;
}): void {
const c = getClient();
if (!c) return;
try {
c.capture({
distinctId: params.distinctId,
event: params.event,
properties: params.properties,
});
} catch (err) {
console.error("[analytics] capture failed", err);
}
}

/**
* Flush pending events. Wire into Fastify's `onClose` so SIGTERM doesn't
* drop in-flight captures.
*/
export async function shutdown(): Promise<void> {
if (!client) return;
try {
await client.shutdown();
} catch (err) {
console.error("[analytics] shutdown failed", err);
}
}
Loading