GitHub - 0bserver07/bourbaki: An autonomous agent for mathematical reasoning and proof

An autonomous agent for mathematical reasoning and proof.

How It Works · Quick Start · Tools · Skills · Autonomous Mode · Commands

Claude Code gives an LLM a shell and dev tools so it can write and run code. Bourbaki does the same thing for math: it gives an LLM a computer algebra system (SymPy), a proof assistant (Lean 4), and research APIs (OEIS, arXiv).

You ask a question in the TUI, the agent computes, verifies, looks things up, and streams the answer back. If it writes a proof, it can formalize it. If it makes a claim, it can check it.

How It Works

You ask a question in the TUI
The backend agent reasons about the approach
It calls tools: SymPy for computation, Lean for verification, OEIS/arXiv for lookup
Results feed back into the agent, which iterates if needed
A scratchpad enforces limits and deduplicates repeated calls
The final answer streams back to the TUI as it's generated

The TUI is a pure display client. All reasoning, tool calls, and state live in the Python backend.

Quick Start

# Clone the repo
git clone https://github.com/0bserver07/bourbaki.git
cd bourbaki

# Start the backend
cd backend
pip install -e .
uvicorn bourbaki.main:app --reload --port 8000

# In another terminal — start the TUI
bun install
bun start

The TUI connects to localhost:8000 by default. Override with BOURBAKI_BACKEND_URL.

Prerequisites

Python 3.11+
Bun v1.0+
An LLM API key (set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY)
Lean 4 with Mathlib (optional, for formal verification)

Tools

Tool	What it does
Symbolic Compute	Native SymPy: simplification, integration, solving, 30+ operations
Lean Prover	Lean 4 + Mathlib, machine-checked formal proofs
Sequence Lookup	OEIS: identify and explore integer sequences
Paper Search	arXiv: find relevant papers and results
Web Search	Exa: search the web for mathematical references

Skills

Skills are proof techniques loaded from SKILL.md files. They tell the agent how to approach a specific type of proof step by step, instead of letting it improvise.

21 built-in skills across five categories:

Basic: induction, strong induction, direct proof, contradiction, pigeonhole, counting
Analysis: epsilon-delta, convergence tests, sequence limits, inequality chains
Geometry: coordinate proof, synthetic construction, transformations
Algebra: group homomorphisms, ring ideals, polynomials
Advanced: extremal arguments, probabilistic method, conjecture exploration, formalization, proof explanation

Skills can be added at three levels: built-in (src/skills/), user (~/.bourbaki/skills/), or project (.bourbaki/skills/).

Autonomous Mode

Long-running proof search via a proposer-builder-reviewer loop backed by a warm LeanREPLSession. One proposal per iteration, bounded by max_iterations (default 50, 8 for interactive). Every reported solve is gated by a lean_prover whole-file compile — no REPL-only claims.

Drive the loop from backend/bourbaki/benchmarks/minif2f.py::attempt_proof_loop or from the FastAPI /query endpoint with use_loop=True. (The TUI's /prove <id> command still points at the legacy /autonomous/start route, which now returns HTTP 410 Gone — the legacy pipeline was deleted in commit 2113629. Rewiring the TUI to the new loop is tracked separately.)

Multi-provider routing and fanout

The proposer defaults to a single GLM-5.1 call. It can also route to other providers, or race a parallel fanout of legs across several providers and take the first valid (or first compile-verified) proof.

A model string's leading prefix selects the provider:

Prefix	Provider
`glm:`	z.ai Anthropic-compat endpoint (default)
`glm-oai:`	z.ai OpenAI-compat endpoint
`ollama:`	local Ollama daemon
`ollama-cloud:`	Ollama Cloud (non-reasoning models)
`ollama-native:`	Ollama Cloud `/api/chat` (reasoning models — sends `think: false`)
`ollama-stream:`	Ollama Cloud `/api/chat`, streaming with early-cancel

The benchmark runners backend/scripts/run_minif2f_subset.py and backend/scripts/run_putnam_loop.py expose this through --proposer-models (an explicit fanout spec), --smart-fanout {minimal,balanced,max} (a fanout built from whichever API keys are set), --auto-route (a per-problem fanout chosen from the problem's source prefix), --reviewer-models (a multi-model reviewer fanout), and --race-mode {first_completed,first_verified}. Run either script with --help for the full flag list, or see ARCHITECTURE.md for how the fanout works.

Results

Verified pass rates on miniF2F valid (every solve confirmed by lean_prover standalone compile — see docs/REALITY_CHECK.md for the audit of the earlier REPL-only era):

Date	Approach	Verified	Sample
2026-02-22 (audit)	v0.2.1 code, lean_prover-gated	6.2% (15/244)	full 244
2026-03-08 (v0.2.2)	+ REPL pipe-recovery + tactic blocklist	25.8% (63/244)	full 244
2026-04-01	+ HILBERT decomposer + in-context solving	50.0% (5/10)	10-problem
2026-04-25	proposer-builder-reviewer loop (GLM-5.1)	90.0% (9/10)	10-problem · 0 false positives
2026-05-09	same loop on a wider sample	62.9% (22/35) †	35-problem stratified · 0 false positives

† This number is a lower bound. The reviewer's final-gate lean_prover call was using a 30s default timeout in tools/lean_prover.py, which is too short for standalone lake env lean + import Mathlib (typically 60-180s). On the 2026-05-09 run, an unknown number of correct proofs were rejected by this timeout. Commit 7b07c07 raises the reviewer's timeout to 240s; a re-run with the fix is tracked in issue #19.

The 2026-02-17 v0.2.0 and 2026-02-18 v0.2.1 releases claimed 91.8% / 94.3% on the valid/test splits. Both numbers were inflated ~15× by REPL false positives and were retracted in the v0.2.2 audit (both GitHub releases now read "RETRACTED (inflated numbers)" in their titles). The current proposer-builder-reviewer architecture (commits 49211ce through 7b07c07) replaces the prior HILBERT-style pipeline; the full 244-problem run with the new architecture is pending (tracked in issue #14).

Known limitations

Run benchmarks from a normal terminal, not an ephemeral agent shell. The standalone scripts in backend/scripts/ (and their just wrappers) launch the asyncio loop inside a single Python process, but the launching shell still owns the process group. A bash-tool-managed session that kills its process group on exit will kill a long-running benchmark with it — use nohup/tmux/screen for overnight runs.
z.ai has two separate billing pools. glm: routes via the Anthropic-compat endpoint, glm-oai: via the OpenAI-compat one. Funds in one pool do NOT cover the other — verify your balance is on the side your prefix targets (the loop defaults to glm:).
Reasoning models need the native endpoint. Ollama Cloud's OpenAI-compat endpoint ignores the think parameter, so reasoning models routed through ollama-cloud: emit unbounded reasoning and never return structured output. Use the ollama-native: or ollama-stream: prefix for reasoning models (Kimi K2.6, DeepSeek); ollama-stream: does not support the mathlib_search tool.
first_verified race mode trades latency for correctness. Under --race-mode first_verified, every proposer leg's proof is compiled with lean_prover before the leg can win. The compile adds 60-180s per leg on a cold cache, so a verified-race fanout finishes slower than a first_completed one; budget per-problem timeouts accordingly.
Pass@N shares a single REPL session across attempts. attempt_proof_pass_at_n runs the loop up to N times per problem but does not reset the Lean session between attempts. The lean4-repl has no clean :reset primitive, so state from attempt K can leak into attempt K+1. Documented in the function docstring; A/B tracked in #18.

Example Usage

Prove a theorem:

❯ Prove that the sum of the first n integers equals n(n+1)/2

⏺ Thinking...
⏺ Symbolic Compute (expression=Sum(k, (k, 1, n)))
  ⎿ Computed result
⏺ Lean Prover
  ⎿ ✓ Verified in 2.3s

Proof by induction. Base case: n = 1, sum = 1 = 1·2/2. ✓
Inductive step: assume ∑_{k=1}^{n} k = n(n+1)/2.
Then ∑_{k=1}^{n+1} k = n(n+1)/2 + (n+1) = (n+1)(n+2)/2. ∎

Compute symbolically:

❯ Factor 84 and find its divisors

⏺ Symbolic Compute (operation=factor, expression=84)
  ⎿ Computed result

84 = 2² × 3 × 7
Divisors: {1, 2, 3, 4, 6, 7, 12, 14, 21, 28, 42, 84}

Identify a sequence:

❯ What sequence is 1, 1, 2, 3, 5, 8, 13?

⏺ Sequence Lookup (query="1,1,2,3,5,8,13")
  ⎿ Found 1 results

A000045 — Fibonacci numbers: F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1.

Commands

Command	What it does
`/help`	Show all commands
`/model <name>`	Switch LLM model
`/skills`	List available proof technique skills
`/problems`	Browse the problem database
`/prove <id>`	Start proof attempt (legacy TUI handler still POSTs to `/autonomous/start`, which now returns 410; use the `attempt_proof_loop` driver or `/query` with `use_loop=True` for the new loop)
`/pause`	Pause proof search (legacy, 410)
`/progress`	Show proof search progress (legacy, 410)
`/sessions`	List saved sessions
`/new`	Start a new session
`/export [format]`	Export last answer (latex, lean, markdown)
`/debug`	Toggle debug mode
`/clear`	Clear the screen

Architecture

src/                          React + Ink TUI (display client)
├── components/               UI components (Input, AgentEventView, AnswerView)
├── hooks/                    useAgentRunner (SSE bridge), useModelSelection
└── skills/                   21 SKILL.md proof technique files

backend/bourbaki/             Python backend (owns all state)
├── agent/                    Pydantic AI agent, prompts, scratchpad, event mapper
├── tools/                    SymPy, Lean 4, OEIS, arXiv, Web Search, Skills
├── sessions/                 Persistence + context compaction
├── prover/                   Proposer-builder-reviewer-memory loop
├── autonomous/               Phase-3 vestige — only `tactics.py` survives (blocklist)
├── benchmarks/               miniF2F + PutnamBench runners
├── problems/                 13 classic problems database
└── server/routes/            FastAPI endpoints (query, sessions, skills, ...)

Tech Stack

Backend: Python, FastAPI, Pydantic AI, SymPy, httpx
TUI: Bun, React + Ink, TypeScript
Verification: Lean 4 + Mathlib
Sequences: OEIS API
Papers: arXiv API

Credits

Named after Nicolas Bourbaki, the collective pseudonym of a group of mathematicians who tried to rewrite all of mathematics from scratch using set theory.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
assets		assets
backend		backend
docs		docs
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
HANDOFF.md		HANDOFF.md
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
justfile		justfile
package.json		package.json
pyrightconfig.json		pyrightconfig.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How It Works

Quick Start

Prerequisites

Tools

Skills

Autonomous Mode

Multi-provider routing and fanout

Results

Known limitations

Example Usage

Commands

Architecture

Tech Stack

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How It Works

Quick Start

Prerequisites

Tools

Skills

Autonomous Mode

Multi-provider routing and fanout

Results

Known limitations

Example Usage

Commands

Architecture

Tech Stack

Credits

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages