Skip to content

0bserver07/bourbaki

Repository files navigation

bourbaki - An autonomous agent for mathematical reasoning and proof.

An autonomous agent for mathematical reasoning and proof.

How It Works · Quick Start · Tools · Skills · Autonomous Mode · Commands

Python 3.11+ Bun 1.0+ Lean 4 Pydantic AI License


Claude Code gives an LLM a shell and dev tools so it can write and run code. Bourbaki does the same thing for math: it gives an LLM a computer algebra system (SymPy), a proof assistant (Lean 4), and research APIs (OEIS, arXiv).

You ask a question in the TUI, the agent computes, verifies, looks things up, and streams the answer back. If it writes a proof, it can formalize it. If it makes a claim, it can check it.

How It Works

Bourbaki agent loop — TUI, Backend, and Tools

  1. You ask a question in the TUI
  2. The backend agent reasons about the approach
  3. It calls tools: SymPy for computation, Lean for verification, OEIS/arXiv for lookup
  4. Results feed back into the agent, which iterates if needed
  5. A scratchpad enforces limits and deduplicates repeated calls
  6. The final answer streams back to the TUI as it's generated

The TUI is a pure display client. All reasoning, tool calls, and state live in the Python backend.

Quick Start

# Clone the repo
git clone https://github.com/0bserver07/bourbaki.git
cd bourbaki

# Start the backend
cd backend
pip install -e .
uvicorn bourbaki.main:app --reload --port 8000

# In another terminal — start the TUI
bun install
bun start

The TUI connects to localhost:8000 by default. Override with BOURBAKI_BACKEND_URL.

Prerequisites

  • Python 3.11+
  • Bun v1.0+
  • An LLM API key (set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY)
  • Lean 4 with Mathlib (optional, for formal verification)

Tools

Tool What it does
Symbolic Compute Native SymPy: simplification, integration, solving, 30+ operations
Lean Prover Lean 4 + Mathlib, machine-checked formal proofs
Sequence Lookup OEIS: identify and explore integer sequences
Paper Search arXiv: find relevant papers and results
Web Search Exa: search the web for mathematical references

Skills

Skills are proof techniques loaded from SKILL.md files. They tell the agent how to approach a specific type of proof step by step, instead of letting it improvise.

21 built-in skills across five categories:

  • Basic: induction, strong induction, direct proof, contradiction, pigeonhole, counting
  • Analysis: epsilon-delta, convergence tests, sequence limits, inequality chains
  • Geometry: coordinate proof, synthetic construction, transformations
  • Algebra: group homomorphisms, ring ideals, polynomials
  • Advanced: extremal arguments, probabilistic method, conjecture exploration, formalization, proof explanation

Skills can be added at three levels: built-in (src/skills/), user (~/.bourbaki/skills/), or project (.bourbaki/skills/).

Autonomous Mode

Long-running proof search via a proposer-builder-reviewer loop backed by a warm LeanREPLSession. One proposal per iteration, bounded by max_iterations (default 50, 8 for interactive). Every reported solve is gated by a lean_prover whole-file compile — no REPL-only claims.

Proposer-Builder-Reviewer loop

Drive the loop from backend/bourbaki/benchmarks/minif2f.py::attempt_proof_loop or from the FastAPI /query endpoint with use_loop=True. (The TUI's /prove <id> command still points at the legacy /autonomous/start route, which now returns HTTP 410 Gone — the legacy pipeline was deleted in commit 2113629. Rewiring the TUI to the new loop is tracked separately.)

Multi-provider routing and fanout

The proposer defaults to a single GLM-5.1 call. It can also route to other providers, or race a parallel fanout of legs across several providers and take the first valid (or first compile-verified) proof.

A model string's leading prefix selects the provider:

Prefix Provider
glm: z.ai Anthropic-compat endpoint (default)
glm-oai: z.ai OpenAI-compat endpoint
ollama: local Ollama daemon
ollama-cloud: Ollama Cloud (non-reasoning models)
ollama-native: Ollama Cloud /api/chat (reasoning models — sends think: false)
ollama-stream: Ollama Cloud /api/chat, streaming with early-cancel

The benchmark runners backend/scripts/run_minif2f_subset.py and backend/scripts/run_putnam_loop.py expose this through --proposer-models (an explicit fanout spec), --smart-fanout {minimal,balanced,max} (a fanout built from whichever API keys are set), --auto-route (a per-problem fanout chosen from the problem's source prefix), --reviewer-models (a multi-model reviewer fanout), and --race-mode {first_completed,first_verified}. Run either script with --help for the full flag list, or see ARCHITECTURE.md for how the fanout works.

Results

Verified pass rates on miniF2F valid (every solve confirmed by lean_prover standalone compile — see docs/REALITY_CHECK.md for the audit of the earlier REPL-only era):

Date Approach Verified Sample
2026-02-22 (audit) v0.2.1 code, lean_prover-gated 6.2% (15/244) full 244
2026-03-08 (v0.2.2) + REPL pipe-recovery + tactic blocklist 25.8% (63/244) full 244
2026-04-01 + HILBERT decomposer + in-context solving 50.0% (5/10) 10-problem
2026-04-25 proposer-builder-reviewer loop (GLM-5.1) 90.0% (9/10) 10-problem · 0 false positives
2026-05-09 same loop on a wider sample 62.9% (22/35) 35-problem stratified · 0 false positives

† This number is a lower bound. The reviewer's final-gate lean_prover call was using a 30s default timeout in tools/lean_prover.py, which is too short for standalone lake env lean + import Mathlib (typically 60-180s). On the 2026-05-09 run, an unknown number of correct proofs were rejected by this timeout. Commit 7b07c07 raises the reviewer's timeout to 240s; a re-run with the fix is tracked in issue #19.

The 2026-02-17 v0.2.0 and 2026-02-18 v0.2.1 releases claimed 91.8% / 94.3% on the valid/test splits. Both numbers were inflated ~15× by REPL false positives and were retracted in the v0.2.2 audit (both GitHub releases now read "RETRACTED (inflated numbers)" in their titles). The current proposer-builder-reviewer architecture (commits 49211ce through 7b07c07) replaces the prior HILBERT-style pipeline; the full 244-problem run with the new architecture is pending (tracked in issue #14).

miniF2F verified pass-rate history

Known limitations

  • Run benchmarks from a normal terminal, not an ephemeral agent shell. The standalone scripts in backend/scripts/ (and their just wrappers) launch the asyncio loop inside a single Python process, but the launching shell still owns the process group. A bash-tool-managed session that kills its process group on exit will kill a long-running benchmark with it — use nohup/tmux/screen for overnight runs.
  • z.ai has two separate billing pools. glm: routes via the Anthropic-compat endpoint, glm-oai: via the OpenAI-compat one. Funds in one pool do NOT cover the other — verify your balance is on the side your prefix targets (the loop defaults to glm:).
  • Reasoning models need the native endpoint. Ollama Cloud's OpenAI-compat endpoint ignores the think parameter, so reasoning models routed through ollama-cloud: emit unbounded reasoning and never return structured output. Use the ollama-native: or ollama-stream: prefix for reasoning models (Kimi K2.6, DeepSeek); ollama-stream: does not support the mathlib_search tool.
  • first_verified race mode trades latency for correctness. Under --race-mode first_verified, every proposer leg's proof is compiled with lean_prover before the leg can win. The compile adds 60-180s per leg on a cold cache, so a verified-race fanout finishes slower than a first_completed one; budget per-problem timeouts accordingly.
  • Pass@N shares a single REPL session across attempts. attempt_proof_pass_at_n runs the loop up to N times per problem but does not reset the Lean session between attempts. The lean4-repl has no clean :reset primitive, so state from attempt K can leak into attempt K+1. Documented in the function docstring; A/B tracked in #18.

Example Usage

Prove a theorem:

❯ Prove that the sum of the first n integers equals n(n+1)/2

⏺ Thinking...
⏺ Symbolic Compute (expression=Sum(k, (k, 1, n)))
  ⎿ Computed result
⏺ Lean Prover
  ⎿ ✓ Verified in 2.3s

Proof by induction. Base case: n = 1, sum = 1 = 1·2/2. ✓
Inductive step: assume ∑_{k=1}^{n} k = n(n+1)/2.
Then ∑_{k=1}^{n+1} k = n(n+1)/2 + (n+1) = (n+1)(n+2)/2. ∎

Compute symbolically:

❯ Factor 84 and find its divisors

⏺ Symbolic Compute (operation=factor, expression=84)
  ⎿ Computed result

84 = 2² × 3 × 7
Divisors: {1, 2, 3, 4, 6, 7, 12, 14, 21, 28, 42, 84}

Identify a sequence:

❯ What sequence is 1, 1, 2, 3, 5, 8, 13?

⏺ Sequence Lookup (query="1,1,2,3,5,8,13")
  ⎿ Found 1 results

A000045 — Fibonacci numbers: F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1.

Commands

Command What it does
/help Show all commands
/model <name> Switch LLM model
/skills List available proof technique skills
/problems Browse the problem database
/prove <id> Start proof attempt (legacy TUI handler still POSTs to /autonomous/start, which now returns 410; use the attempt_proof_loop driver or /query with use_loop=True for the new loop)
/pause Pause proof search (legacy, 410)
/progress Show proof search progress (legacy, 410)
/sessions List saved sessions
/new Start a new session
/export [format] Export last answer (latex, lean, markdown)
/debug Toggle debug mode
/clear Clear the screen

Architecture

src/                          React + Ink TUI (display client)
├── components/               UI components (Input, AgentEventView, AnswerView)
├── hooks/                    useAgentRunner (SSE bridge), useModelSelection
└── skills/                   21 SKILL.md proof technique files

backend/bourbaki/             Python backend (owns all state)
├── agent/                    Pydantic AI agent, prompts, scratchpad, event mapper
├── tools/                    SymPy, Lean 4, OEIS, arXiv, Web Search, Skills
├── sessions/                 Persistence + context compaction
├── prover/                   Proposer-builder-reviewer-memory loop
├── autonomous/               Phase-3 vestige — only `tactics.py` survives (blocklist)
├── benchmarks/               miniF2F + PutnamBench runners
├── problems/                 13 classic problems database
└── server/routes/            FastAPI endpoints (query, sessions, skills, ...)

Tech Stack

  • Backend: Python, FastAPI, Pydantic AI, SymPy, httpx
  • TUI: Bun, React + Ink, TypeScript
  • Verification: Lean 4 + Mathlib
  • Sequences: OEIS API
  • Papers: arXiv API

Credits

Named after Nicolas Bourbaki, the collective pseudonym of a group of mathematicians who tried to rewrite all of mathematics from scratch using set theory.

License

MIT License

Packages

 
 
 

Contributors