Skip to content

gambitinc/baml-experimental

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2,985 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

baml-experimental

A fork of BAML — an open-source typed language for building reliable LLM workflows and agents. This repository tracks the upstream project and layers on four experimental capabilities.

Read about upstream BAML first: docs.boundaryml.com and the main repo at github.com/BoundaryML/baml.

Everything upstream still works exactly as documented — this fork is strictly additive. The experimental features live alongside the standard BAML workspace.


What this fork adds

1. Tower of BAML — nondeterministic verifiable transpilation

./tower_of_baml/ — a CLI that transpiles a Python function into BAML and then uses Z3 to prove the two are equivalent for every possible input. It generates candidate BAML with claude -p and rejects any candidate that isn't formally verified by an SMT solver.

Four verification modes:

  • arithmetic — integer functions, Z3 integer theory
  • loopwhile loops with LLM-generated loop invariants and Hoare-triple discharge
  • string — string-concatenation functions, Z3 string theory
  • http — OpenAI-style chat-completion calls via http.client, verified by rendering both sides with a sentinel and proving byte-identity of the outgoing HTTP request via Z3 string theory

Based on the LLMLift methodology (Bhatia et al., NeurIPS 2024, arXiv:2406.03003). One verified candidate usually lands on the first LLM attempt; the Z3 unsat result is a proof of equivalence over all inputs, not a batch of passing tests.

cd tower_of_baml && cargo build --release
./target/release/tower examples/add.py --mode arithmetic
./target/release/tower examples/sum_to.py --mode loop \
    --pre "n >= 0" --spec "n*(n+1)/2"
./target/release/tower examples/build_url.py --mode string
./target/release/tower examples/classify.py --mode http

See tower_of_baml/README.md and tower_of_baml/docs/ARCHITECTURE.md for the full design.

Install pre-built binaries (macOS arm64/x86_64, Linux x86_64):

curl -sSf https://raw.githubusercontent.com/gambitinc/baml-experimental/canary/install.sh | bash

The installer detects your platform, downloads the matching release tarball, drops tower and city into ~/.local/bin/, and checks for the three runtime deps (z3, baml-cli, claude CLI). Maintainers cutting a release: see docs/RELEASE.md.

./city/ is the codebase-scale companion: walks an entire Python tree, auto-classifies each file into a tower mode (or skips it), runs tower on every candidate, and mirrors the result into a sibling BAML tree — the original source is untouched. Loop-mode files declare their precondition and spec via @tower-pre: / @tower-spec: docstring annotations; everything else is fully automatic.

cd city && cargo build --release
./target/release/city \
    ../tower_of_baml/tests/sample_codebase \
    /tmp/sample_baml -v

A small fixture codebase lives at tower_of_baml/tests/sample_codebase/ covering arithmetic, string, loop, and various non-transpilable patterns — useful for testing both city and tower together.

city in action

A realistic demo lives at examples/content_pipeline/ — a twelve-file research-assistant workflow (ingest/, agents/, utils/, nontranspilable/, config/). It covers every tower mode and every skip reason. The following is a literal play-by-play of running

./city/target/release/city \
    examples/content_pipeline \
    /tmp/pipeline_baml -v

Step 1 — Walk the source tree. city's WalkDir enumerates every file under content_pipeline/ depth-first. Directories are mirrored into the destination as empty stubs; the README.md is copied verbatim (not a .py file).

Step 2 — config/settings.py (no function). city's classifier parses the source, finds no top-level def, emits Mode2::Skip { reason: "no_top_level_function" }. Copied through to /tmp/pipeline_baml/config/settings.py unchanged.

[skip] config/settings.py: no_top_level_function

Step 3 — ingest/fetch_headline.py (http mode). The classifier walks the function body, sees http.client.HTTPSConnection(...) plus .request("POST", ...), and classifies as Mode2::Http. city shells out to tower --mode http, which spawns a Python subprocess with a monkey-patched http.client to capture the outgoing request, calls $build_request on the LLM-generated BAML candidate, normalizes both, partitions at a sentinel, and asks Z3 for a counterexample. Z3 returns unsat; the BAML is written to /tmp/pipeline_baml/ingest/fetch_headline.baml.

[run] ingest/fetch_headline.py → http (fetch_headline)
  ✓ ingest/fetch_headline.py

Step 4 — agents/classify_topic.py (http-static mode, openai SDK). The classifier walks the body, sees <client>.chat.completions.create(...), emits Mode2::HttpStatic { sdk: "openai" }. city dispatches to tower --mode http-static. Tower's openai contract extracts the call's kwargs as a symbolic JSON tree, substitutes a sentinel for the text parameter, canonicalizes to a matching byte pattern, invokes BAML's $build_request with the same sentinel, and Z3 proves byte-identity without ever executing the Python module.

    (sdk detected: openai)
[run] agents/classify_topic.py → http-static (classify_topic)
  ✓ agents/classify_topic.py

Step 5 — agents/compose_brief.py (http-static, openai SDK, multi-param via f-string). The function takes two string parameters topic and text, interleaved inside a single user message via f"Topic: {topic}\nArticle text:\n{text}". The static verifier assigns one sentinel to each parameter (xSNTNL000PLACEHLDRx, xSNTNL001PLACEHLDRx), produces a BodyNode::StringTemplate for the f-string, canonicalizes to "Topic: xSNTNL000PLACEHLDRx\nArticle text:\nxSNTNL001PLACEHLDRx", and partitions the body into a slot sequence: [Fixed, Param("topic"), Fixed, Param("text"), Fixed]. BAML emits the same slot shape via Topic: {{ topic }}\nArticle text:\n{{ text }}. Z3 builds Concat(p0, topic_var, p1, text_var, p2) on each side, asserts inequality, returns unsat. Proven over String × String.

    (sdk detected: openai)
[run] agents/compose_brief.py → http-static (compose_brief)
  ✓ agents/compose_brief.py

Step 6 — agents/summarize.py (http-static, LiteLLM). Classifier sees litellm.completion(...), emits Mode2::HttpStatic { sdk: "litellm" }. tower's litellm contract reads the model="openai/gpt-4o-mini" kwarg, splits on /, and dispatches to the openai endpoint contract with the prefix stripped from the model name. The resulting request routes to api.openai.com/v1/chat/completions — matching what BAML emits for a provider openai client. Proven.

    (sdk detected: litellm)
[run] agents/summarize.py → http-static (summarize)
  ✓ agents/summarize.py

Step 7 — agents/extract_entities.py (http-static, anthropic). Classifier sees <client>.messages.create(...) and emits Mode2::HttpStatic { sdk: "anthropic" }. tower's anthropic contract knows the endpoint is api.anthropic.com/v1/messages, auth is via x-api-key + anthropic-version, and the system kwarg is top-level (not part of messages). BAML emits system as a structured-content array ([{"type":"text","text":"..."}]); normalization unwraps that shape to plain string before comparison. Proven.

    (sdk detected: anthropic)
[run] agents/extract_entities.py → http-static (extract_entities)
  ✓ agents/extract_entities.py

Step 8 — utils/tally_matches.py (loop mode). Classifier strips the leading docstring, sees a while loop followed by a return, and reads @tower-pre: n >= 0 and @tower-spec: n*(n+1)/2 from the docstring. Emits Mode2::Loop. tower's loop verifier discharges three Hoare triples: initialization (invariant holds after let s = 0; let i = 1;), preservation (one body iteration preserves the invariant), and post-condition (invariant + loop-exit implies s == n*(n+1)/2). All three return unsat on their negations. The BAML includes the // @invariant: comment the LLM produced.

[run] utils/tally_matches.py → loop (tally_matches)
  ✓ utils/tally_matches.py

Step 9 — utils/estimate_reading_time.py (arithmetic). Classifier sees a single return with only int literals, int parameters, and +/-/*. Emits Mode2::Arithmetic. tower parses both sides into Z3 integer expressions, asserts py_expr != baml_expr, returns unsat. Commutativity, associativity, and distributivity are all handled by Z3 automatically — tower doesn't care whether the BAML wrote paragraphs * words_per_paragraph or words_per_paragraph * paragraphs.

[run] utils/estimate_reading_time.py → arithmetic (estimate_reading_time)
  ✓ utils/estimate_reading_time.py

Step 10 — utils/build_share_url.py (string). Return expression is a chain of str parameters and string literals joined by +. Classifier emits Mode2::String. tower translates both sides into Z3 Concat trees. Z3's string theory treats Concat as associative, so Python's left-associated Concat(Concat(Concat("https://", host), "/article/"), slug)... is provably equal to BAML's flat Concat("https://", host, "/article/", slug, ...) without any canonicalization. unsat. Proven.

[run] utils/build_share_url.py → string (build_share_url)
  ✓ utils/build_share_url.py

Step 11 — main.py (skip). First top-level function is run_cli(text_path, slug, token), whose body contains file I/O, multiple statement types, and calls into the orchestrator. Doesn't match any mode's expected shape. Skipped with reason unrecognized_function_shape. The file is copied verbatim to /tmp/pipeline_baml/main.py.

[skip] main.py: unrecognized_function_shape

Step 12 — nontranspilable/pipeline_orchestrator.py (skip). Module defines @dataclass and class PipelineOrchestrator, but no top-level def. Classifier fails the "first top-level function" lookup. Verdict: no_top_level_function. Copied through.

[skip] nontranspilable/pipeline_orchestrator.py: no_top_level_function

Step 13 — nontranspilable/http_probe.py (skip). The function DOES use http.client.HTTPSConnection, but its .request(...) call's first argument is the literal "HEAD", not "POST". tower's http mode only handles POST-shaped LLM calls. Classifier's has_http_shape rejects and falls through to unrecognized_function_shape. Copied through.

[skip] nontranspilable/http_probe.py: unrecognized_function_shape

Step 14 — write the report. After the walk, city serializes its RunReport (summary counters + per-file FileReport entries) to /tmp/pipeline_baml/.city_report.json and prints the summary table. The whole run took about 40 seconds (eight LLM calls at ~4-5 seconds each dominate the wall time; Z3 is single-digit milliseconds per proof).

=== city summary ===
python files:   12
  arithmetic:   1 ok / 0 fail
  string:       1 ok / 0 fail
  loop:         1 ok / 0 fail
  http:         1 ok / 0 fail
  http-static:  4 ok / 0 fail
  skipped:      4
non-py copied:  1

Final destination tree:

/tmp/pipeline_baml/
├── .city_report.json
├── README.md                                     (copied)
├── config/
│   └── settings.py                               (copied, not a candidate)
├── ingest/
│   └── fetch_headline.baml                       ✓ verified (http)
├── agents/
│   ├── classify_topic.baml                       ✓ verified (http-static openai)
│   ├── compose_brief.baml                        ✓ verified (http-static openai, 2 params)
│   ├── extract_entities.baml                     ✓ verified (http-static anthropic)
│   └── summarize.baml                            ✓ verified (http-static litellm)
├── utils/
│   ├── build_share_url.baml                      ✓ verified (string)
│   ├── estimate_reading_time.baml                ✓ verified (arithmetic)
│   └── tally_matches.baml                        ✓ verified (loop)
├── nontranspilable/
│   ├── http_probe.py                             (copied, HEAD not POST)
│   └── pipeline_orchestrator.py                  (copied, class-only)
└── main.py                                       (copied, multi-statement CLI)

Everything under /tmp/pipeline_baml/*.baml is formally verified to be equivalent to the corresponding Python for every possible input (integer inputs for arithmetic/loop; string inputs for string; string inputs for http and http-static; n ≥ 0 pre-condition for the loop). The original Python tree at examples/content_pipeline/ was never modified — city is read-only over the source.

See tower_of_baml/docs/verification.md for how each proof actually works.

2. BAML Hyperframes — a fast hypermedia framework

Status: in progress. A port of Hygen Hyperframes re-tuned for throughput under BAML-powered backends. Expected to live at ./baml_hyperframes/ once the first milestone lands.

3. New BAML SDKs — Swift, first

Status: Swift SDK in progress. Upstream BAML ships Python, TypeScript, Ruby, and Go clients via baml-cli generate. This fork adds Swift as a first-class generator target, to be followed by other languages driven by community demand. The codegen crate will live alongside baml_codegen_python/ and friends inside the BAML workspace.

4. Live BEP & changelog viewer

Status: design phase. Upstream BAML uses beps/ for RFC-style BAML Enhancement Proposals and a root CHANGELOG.md for release notes. This fork ships a lightweight viewer that surfaces both streams in the developer's editor / browser with diffing, status filters, and cross- linking. Target: keep the authors and reviewers of language changes in closer contact with end users during proposal review.


Layout

baml-experimental/
├── baml/                     # upstream BAML language + compiler workspace (unchanged)
├── baml_language/            # upstream BAML language (unchanged)
├── baml-cli/                 # upstream CLI (unchanged)
├── beps/                     # upstream BAML Enhancement Proposals
├── CHANGELOG.md              # upstream changelog
├── ...                       # everything else from upstream
│
├── tower_of_baml/            # [NEW] Python→BAML transpiler + Z3 verifier
│   ├── README.md
│   ├── Cargo.toml
│   ├── baml_reference.md     # language reference embedded in LLM prompts
│   ├── docs/ARCHITECTURE.md  # pipeline, SMT methodology, extension points
│   ├── src/                  # Rust source
│   │   ├── main.rs
│   │   ├── prompts.rs
│   │   ├── baml_parse.rs
│   │   ├── python_parse.rs
│   │   └── verify/           # three SMT verifiers
│   ├── examples/             # Python sources + verified BAML outputs
│   └── tests/sample_codebase/ # fixture corpus exercised by city
│
├── city/                     # [NEW] codebase-scale runner over tower
│   ├── README.md
│   ├── Cargo.toml
│   └── src/                  # main.rs (CLI), classify.rs, runner.rs
│
├── baml_hyperframes/         # [NEW, in progress] hypermedia framework
├── codegen_swift/            # [NEW, in progress] Swift SDK generator
└── bep_viewer/               # [NEW, in progress] live BEP + changelog UI

Everything outside the [NEW] directories is upstream BAML; it should track github.com/BoundaryML/baml's canary branch and carry no local modifications. If you want to build or use core BAML, use the unchanged upstream directories exactly as the upstream README describes.


Relationship to upstream

The four features above are maintained here until one of them is ready to propose upstream as a BEP. This fork is not intended to diverge from BAML permanently — each experiment should graduate to the main project or be retired based on usage and feedback.

Contributions welcome; please open an issue describing which of the four tracks you want to work on.

License

Inherits the upstream BAML license. See LICENSE.

About

An experimental baml port: testing nondeterministic transpilation, hyperframe changelog, baml mini projects, etc

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Rust 68.1%
  • TypeScript 15.0%
  • Python 4.2%
  • MDX 3.6%
  • JavaScript 3.3%
  • Go 2.3%
  • Other 3.5%