Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
bedc9d0
Subscribe to theme changes
AdamEXu Jun 2, 2026
764a1fa
OR Base URL
AdamEXu Jun 3, 2026
f909a8a
Merge branch 'main' of https://github.com/tinyfish-io/bigset
AdamEXu Jun 3, 2026
228550c
add local mode
AdamEXu Jun 3, 2026
f7ee62b
swap in wordmarks
AdamEXu Jun 3, 2026
8f155cd
Add OS keychain storage for local credentials
AdamEXu Jun 4, 2026
eee5b3c
Remove local OpenRouter OAuth setup
AdamEXu Jun 4, 2026
15bc03a
remove nextjs dev indicator
AdamEXu Jun 4, 2026
376c777
Merge branch 'main' of https://github.com/tinyfish-io/bigset
AdamEXu Jun 4, 2026
cac2e9e
allow more origins (less strict CORS)
AdamEXu Jun 4, 2026
09d612e
fix up settings page
AdamEXu Jun 4, 2026
4b73c10
add build script
AdamEXu Jun 4, 2026
7d993dd
Merge branch 'main' of https://github.com/tinyfish-io/bigset
AdamEXu Jun 5, 2026
c7f77e8
fix a few bugs and things brought up by coderabbit
AdamEXu Jun 5, 2026
875cda1
remove the convex skip thing
AdamEXu Jun 5, 2026
7f074ec
Merge branch 'main' into main
simantak-dabhade Jun 5, 2026
aea1de0
check for bad open router credential
AdamEXu Jun 5, 2026
ae397b3
undo something
AdamEXu Jun 5, 2026
d918c06
Merge branch 'main' of https://github.com/AdamEXu/bigset
AdamEXu Jun 5, 2026
d596115
add automated building scripts
AdamEXu Jun 5, 2026
f3dae21
convex! to the github!
AdamEXu Jun 5, 2026
59a7597
Fix Windows release backend bundling
AdamEXu Jun 5, 2026
16ea01c
Pin release workflow actions to SHAs
AdamEXu Jun 6, 2026
2dedc18
fix some issues that coderabbit didn't like
AdamEXu Jun 6, 2026
dbe6924
fix build thing
AdamEXu Jun 6, 2026
4735533
update README
AdamEXu Jun 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 0 additions & 50 deletions .env.example

This file was deleted.

101 changes: 101 additions & 0 deletions .github/workflows/build-release-artifacts.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
name: Build Release Artifacts

on: # yamllint disable-line rule:truthy
workflow_dispatch:
inputs:
release_tag:
description: Existing release tag to upload assets to. Leave empty to only upload workflow artifacts.
required: false
type: string
release:
types: [published]

permissions:
contents: write

jobs:
build:
name: Build ${{ matrix.platform }}
runs-on: ${{ matrix.runner }}
strategy:
fail-fast: false
matrix:
include:
- platform: darwin-arm64
runner: macos-15
- platform: darwin-x64
runner: macos-15-intel
- platform: linux-arm64
runner: ubuntu-24.04-arm
- platform: linux-x64
runner: ubuntu-24.04
- platform: win32-arm64
runner: windows-11-arm
- platform: win32-x64
runner: windows-2025

env:
ARTIFACT_NAME: bigset-build-${{ matrix.platform }}.zip
RELEASE_TAG: ${{ github.event.release.tag_name || inputs.release_tag }}

steps:
- name: Checkout
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5
with:
persist-credentials: false

- name: Setup Node
uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020
with:
node-version: "24"

- name: Install frontend dependencies
working-directory: frontend
run: npm install --silent

- name: Install backend dependencies
working-directory: backend
run: npm install --silent

- name: Build release
run: node scripts/build-release.mjs

- name: Rename artifact
run: node -e "const fs = require('fs'); fs.renameSync('dist/bigset-build.zip', 'dist/' + process.env.ARTIFACT_NAME);"

- name: Upload workflow artifact
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02
with:
name: ${{ matrix.platform }}
path: dist/${{ env.ARTIFACT_NAME }}
if-no-files-found: error

- name: Validate release tag
if: github.event_name == 'release' || inputs.release_tag != ''
shell: bash
run: |
if [[ -z "$RELEASE_TAG" ]]; then
echo "Release tag is required when uploading release assets." >&2
exit 1
fi
if [[ ! "$RELEASE_TAG" =~ ^[A-Za-z0-9][A-Za-z0-9._/@+-]*$ ]]; then
echo "Release tag contains unsupported characters." >&2
exit 1
fi

- name: Upload release asset
if: github.event_name == 'release' || inputs.release_tag != ''
shell: bash
env:
GH_TOKEN: ${{ github.token }}
run: gh release upload "$RELEASE_TAG" "dist/$ARTIFACT_NAME" --clobber

- name: Upload legacy release asset
if: (github.event_name == 'release' || inputs.release_tag != '') && matrix.platform == 'darwin-arm64'
shell: bash
env:
GH_TOKEN: ${{ github.token }}
run: |
node -e "const fs = require('fs'); fs.copyFileSync('dist/' + process.env.ARTIFACT_NAME, 'dist/bigset-build.zip');"
gh release upload "$RELEASE_TAG" "dist/bigset-build.zip" --clobber
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ yarn-debug.log*

# Local-only files
*.bak
.local/
dist/
tmp/
temp/

Expand Down
1 change: 0 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@

## What not to do

- Do not add Clerk, Auth0, or any third-party auth service. We use Better Auth (self-hosted).
- Do not add API routes to the frontend. All API logic belongs in the backend.
- Do not hardcode ports. Read from env vars (`PORT`, `CLIENT_ORIGIN`, `BETTER_AUTH_URL`).
- Do not commit `.env` files or secrets.
Expand Down
124 changes: 72 additions & 52 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,63 +67,72 @@ Any dataset. Any source. Always fresh. That's the idea.

## 🚀 Quick Start

**Prerequisites:** [Docker](https://docs.docker.com/get-docker/) and [Make](https://www.gnu.org/software/make/)
**Prerequisites:** [Node.js](https://nodejs.org/) 22+ with npm.

You'll also need API keys from three services (all free to set up):
```bash
npm install --global @adamexu/bigset
bigset
```

That's it. The `bigset` command downloads the current local BigSet release,
starts Convex, the backend, the frontend, and the local credential bridge, then
prints the app URL. Open [127.0.0.1:3500](http://127.0.0.1:3500) in your web browser to use it.

The first run caches release files under `~/.bigset`; after that, starting
BigSet is designed to take only a few seconds.

On first launch, BigSet sends you to setup. You'll connect two services:

| Service | What it's for | Get your key |
|---------|--------------|-------------|
| **TinyFish** | Web search + page fetching | [tinyfish.ai/api-keys](https://agent.tinyfish.ai/api-keys?utm_source=github&utm_medium=organic&utm_campaign=bigset-developer-2026q2) |
| **OpenRouter** | LLM calls (schema inference + agents) | [openrouter.ai/settings/keys](https://openrouter.ai/settings/keys) |
| **Clerk** | User authentication | [dashboard.clerk.com](https://dashboard.clerk.com) |

### Step 1: Clone the repo
Local API keys are stored in your OS keychain.

For a one-off run without installing globally:

```bash
git clone https://github.com/tinyfish-io/bigset.git
cd bigset
cp .env.example .env
npx @adamexu/bigset
```

### Step 2: Set up TinyFish (web access)
Useful local options:

TinyFish powers all web search and page fetching. Search and Fetch have generous rate limits.

1. Go to [tinyfish.ai](https://www.tinyfish.ai?utm_source=github&utm_medium=organic&utm_campaign=bigset-developer-2026q2) and create an account
2. Go to [API Keys](https://agent.tinyfish.ai/api-keys?utm_source=github&utm_medium=organic&utm_campaign=bigset-developer-2026q2) and create a key
3. Paste it as `TINYFISH_API_KEY` in `.env`
| Command | What it does |
|---------|-------------|
| `bigset --force` | Redownload the latest cached release |
| `bigset --app-port 4500 --backend-port 4501` | Use alternate app/backend ports |
| `bigset --home ~/.bigset-dev` | Use a separate local cache directory |

### Step 3: Set up OpenRouter (LLM)
---

OpenRouter routes LLM calls to Claude Sonnet (schema inference) and Qwen (agents). It's pay-as-you-go; a dataset costs a few dollars in LLM usage.
## Developing From Source

1. Go to [openrouter.ai](https://openrouter.ai) and create an account
2. Go to [Settings → Keys](https://openrouter.ai/settings/keys) and create an API key
3. Paste it as `OPENROUTER_API_KEY` in `.env`
4. Add some credits; $5-10 is plenty to start
Use this path when you're changing BigSet itself. The supported development
workflow is still `make dev`.

### Step 4: Set up Clerk (auth)
**Prerequisites:** [Node.js](https://nodejs.org/) 22+ with npm,
[Docker](https://docs.docker.com/get-docker/), and
[Make](https://www.gnu.org/software/make/).

Clerk handles user sign-in. The setup takes ~2 minutes:
### Step 1: Clone the repo

1. Go to [dashboard.clerk.com](https://dashboard.clerk.com) and create a new application
2. Pick a sign-in method (email, Google, GitHub, whatever you prefer)
3. Once created, go to **Configure → API Keys** in the sidebar
- Copy **Publishable Key** → paste as `NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY` in `.env`
- Copy **Secret Key** → paste as `CLERK_SECRET_KEY` in `.env`
4. Go to **Configure → JWT Templates** in the sidebar
- Click **New template** → select the **Convex** template → click **Save**
5. Go to **Configure → Settings** (or **Domains**)
- Find your **Issuer URL** (looks like `https://your-app-name.clerk.accounts.dev`)
- Paste it as `CLERK_JWT_ISSUER_DOMAIN` in `.env`
```bash
git clone https://github.com/tinyfish-io/bigset.git
cd bigset
```

### Step 5: Start everything
### Step 2: Start everything

```bash
make dev
```

This installs dependencies, builds and starts all Docker services (Postgres, Convex, frontend, backend, Mastra), and deploys the Convex schema. On first run, it automatically generates the Convex admin key — no manual steps needed. See [How `make dev` Works](#how-make-dev-works) for the full breakdown.
`make dev` creates a local `.env` if needed, installs dependencies, builds and
starts all Docker services (Postgres, Convex, frontend, backend, Mastra), and
deploys the Convex schema. On first run, it automatically generates the Convex
admin key. See [How `make dev` Works](#how-make-dev-works) for the full
breakdown.

Once everything is ready, you'll see:

Expand All @@ -133,13 +142,26 @@ Once everything is ready, you'll see:
| **Convex dashboard** | [localhost:6791](http://localhost:6791) |
| **Mastra Studio** (workflow inspector) | [localhost:4111](http://localhost:4111) |

Open [localhost:3500](http://localhost:3500) and click **Get started** to sign in.
Open [localhost:3500](http://localhost:3500). The setup screen will ask for
TinyFish and OpenRouter credentials and save them to your OS keychain for this
workspace.

### Step 3: Connect TinyFish and OpenRouter

TinyFish powers web search and page fetching. OpenRouter routes LLM calls to
the models BigSet uses for schema inference and agents.

1. Create a TinyFish key at [agent.tinyfish.ai/api-keys](https://agent.tinyfish.ai/api-keys?utm_source=github&utm_medium=organic&utm_campaign=bigset-developer-2026q2)
2. Create an OpenRouter key at [openrouter.ai/settings/keys](https://openrouter.ai/settings/keys)
3. Paste both into BigSet's setup screen

OpenRouter is pay-as-you-go; $5-10 is plenty to start.

> **Note:** root `.env` is the only local env file. If you edit Convex functions in `frontend/convex/`, run `make convex-push` to deploy the changes.

> **Free tier:** each signed-in account gets **2,500 row operations per calendar month** (resets on the 1st, UTC). The header shows a live usage badge; system-owned curated datasets bypass the quota.
> **Free tier:** cloud signed-in accounts get **2,500 row operations per calendar month** (resets on the 1st, UTC). Local mode bypasses the cloud quota and uses your TinyFish/OpenRouter accounts directly.

### Step 6 (optional): Load curated datasets
### Step 4 (optional): Load curated datasets

BigSet includes 9 curated public datasets (AI companies hiring, GPU prices, model pricing, etc.) that show on the landing page:

Expand All @@ -155,15 +177,16 @@ This is idempotent; safe to run multiple times.

`make dev` is designed to handle everything — first run, subsequent runs, and recovery from bad state. You should never need to run any other setup command. Here's what it does, in order:

1. **Validates your `.env`** — checks that all required API keys are set (Clerk, OpenRouter, TinyFish). Stops with a clear error if anything is missing.
1. **Validates your `.env`** — creates local keychain bridge settings automatically.
2. **Installs dependencies** — runs `npm install` in both `frontend/` and `backend/`. Silent if already up to date.
3. **Starts the database layer** — brings up Postgres and Convex (self-hosted) first, since other services depend on them.
4. **Waits for Convex** — polls the Convex health endpoint until it's ready (up to 120s).
5. **Ensures the admin key** — if `CONVEX_SELF_HOSTED_ADMIN_KEY` is empty in `.env`, generates one automatically and writes it. If a key exists, validates it against the running Convex instance. If the key is stale (e.g. you ran `make clean` and wiped the database), it detects the mismatch and regenerates.
6. **Pushes Convex config** — sets the Clerk JWT issuer URL in Convex so auth tokens are validated correctly.
7. **Deploys Convex schema** — pushes the table schema and functions from `frontend/convex/` to the running instance.
8. **Starts remaining services** — brings up the frontend, backend, and Mastra. These read the now-populated `.env` including the admin key.
9. **Streams logs** — tails all container logs so you can see what's happening. `Ctrl+C` to stop watching (containers keep running).
3. **Starts the local keychain bridge** — runs a host-side helper so Docker services can read/write this workspace's OS keychain entries.
4. **Starts the database layer** — brings up Postgres and Convex (self-hosted) first, since other services depend on them.
5. **Waits for Convex** — polls the Convex health endpoint until it's ready (up to 120s).
6. **Ensures the admin key** — if `CONVEX_SELF_HOSTED_ADMIN_KEY` is empty in `.env`, generates one automatically and writes it. If a key exists, validates it against the running Convex instance. If the key is stale (e.g. you ran `make clean` and wiped the database), it detects the mismatch and regenerates.
7. **Configures Convex auth** — sets `BIGSET_LOCAL_MODE=1` for the local app.
8. **Deploys Convex schema** — pushes the table schema and functions from `frontend/convex/` to the running instance.
9. **Starts remaining services** — brings up the frontend, backend, and Mastra. These read the now-populated `.env` including the admin key.
10. **Streams logs** — tails all container logs so you can see what's happening. `Ctrl+C` to stop watching (containers keep running).

### Commands

Expand All @@ -188,8 +211,7 @@ Other commands you might use during development:

| Problem | What happens |
|---------|-------------|
| Missing `.env` | Error: "Run: cp .env.example .env" |
| Missing API key | Error tells you exactly which key to set |
| Missing `.env` | `make dev` creates a local one automatically |
| Stale admin key (after `make clean`) | Detected automatically, regenerated |
| Containers already running | No-op for running services, starts any that are missing |
| Convex won't start | Error after 120s timeout — check Docker is running |
Expand All @@ -202,12 +224,10 @@ If you want a completely fresh start: `make clean` then `make dev`.

| Variable | Required | Where to get it |
|----------|----------|----------------|
| `TINYFISH_API_KEY` | ✅ | [tinyfish.ai](https://agent.tinyfish.ai/api-keys?utm_source=github&utm_medium=organic&utm_campaign=bigset-developer-2026q2) → API Keys |
| `OPENROUTER_API_KEY` | ✅ | openrouter.ai → Settings → Keys |
| `NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY` | ✅ | Clerk dashboard → API Keys |
| `CLERK_SECRET_KEY` | ✅ | Clerk dashboard → API Keys |
| `CLERK_JWT_ISSUER_DOMAIN` | ✅ | Clerk dashboard → Settings/Domains |
| `TINYFISH_API_KEY` | Optional | Usually entered in setup and stored in your OS keychain |
| `OPENROUTER_API_KEY` | Optional | Usually entered in setup and stored in your OS keychain |
| `CONVEX_SELF_HOSTED_ADMIN_KEY` | Auto | Auto-generated by `make dev` on first run |
| `LOCAL_KEYCHAIN_PORT`, `LOCAL_KEYCHAIN_TOKEN`, `BIGSET_LOCAL_WORKSPACE_ID` | Auto | Auto-generated by `make dev` for local OS keychain access |
| `RESEND_API_KEY` | Optional | For "dataset ready" emails. Leave blank to skip. |
| `NEXT_PUBLIC_POSTHOG_KEY` | Optional | For product analytics. Leave blank to disable. |

Expand All @@ -219,7 +239,7 @@ If you want a completely fresh start: `make clean` then `make dev`.
|-------|------|
| Frontend | Next.js 16, React 19, Tailwind 4 |
| Backend | Fastify, TypeScript (agent runner) |
| Auth | [Clerk](https://clerk.com) |
| Auth | Local auth |
| Database | [Convex](https://convex.dev) (self-hosted) |
| Data Collection | [TinyFish](https://www.tinyfish.ai?utm_source=github&utm_medium=organic&utm_campaign=bigset-developer-2026q2) APIs (Search, Fetch, Browser) |
| AI orchestration | [Mastra](https://mastra.ai) workflows + [Vercel AI SDK](https://sdk.vercel.ai) + [OpenRouter](https://openrouter.ai) → Claude Sonnet (schema inference + populate agent) |
Expand Down
Loading