This tracks parity for eval only against:
braintrust-sdk/jsCLI (npx braintrust eval)braintrust-sdk/pyCLI (braintrust eval)
--push parity is intentionally out of scope for now.
Recommendation: defer adding --tsconfig in bt for now.
Reasoning:
- With the current runner-first architecture (
--runner,BT_EVAL_JS_RUNNER) we execute user code with their runtime (tsx,bun,ts-node, etc.), and that runtime already owns TS config discovery. - Adding
--tsconfiginbtonly makes sense ifbtitself compiles/bundles TS (like legacy JS CLI bundling flow). - If we later need explicit control, add a runner-agnostic pass-through (for example:
--runner-arg ...) instead of hardcoding TS-specific behavior.
Legend:
done: implemented inbtpartial: implemented but behavior differstodo: missing
| Feature / Flag | JS CLI | PY CLI | bt |
Notes |
|---|---|---|---|---|
| Run eval files | yes | yes | done |
Single-language per invocation currently enforced. |
| Local/no-upload mode | --no-send-logs |
--no-send-logs |
done (--local, alias --no-send-logs) |
|
| Global auth/env passthrough | yes | yes | done |
Via base args/env (BRAINTRUST_API_KEY, BRAINTRUST_API_URL, project). |
| Progress rendering | yes | yes | partial |
bt consumes local SSE and renders Rust TUI/progress, but not full SDK parity yet. |
--list (discover only) |
yes | yes | todo |
|
--filter |
yes | yes | todo |
|
--jsonl summaries |
yes | yes | todo |
|
--terminate-on-failure |
yes | yes | todo |
|
--watch |
yes | yes | partial |
Poll-based watcher with Node/Bun dependency hooks, Deno graph collection, and static JS import fallback. |
--verbose |
yes | parent flag | todo |
|
--env-file |
yes | yes | todo |
|
--dev remote eval server |
yes | yes | todo |
Important for test_remote_evals.py parity. |
--dev-host |
yes | yes | todo |
|
--dev-port |
yes | yes | todo |
|
--dev-org-name |
yes | yes | todo |
|
--num-workers |
n/a | yes | todo |
Python-specific concurrency control. |
| Directory input expansion | yes | yes | todo |
Today bt expects explicit files/extensions. |
| Mixed runtime selection | n/a | n/a | partial |
Current --runner plus env vars; per-language runner matrix deferred. |
Source repo scanned: braintrust/tests/bt_services
test_bundled_code.pynpx braintrust eval <file> --bundle --jsonlnpx braintrust eval <file> --bundle --jsonl --push
test_function_hooks.pynpx braintrust eval <file> --bundle --jsonl --push
test_remote_evals.pynpx braintrust eval <file> --dev --dev-port <port>braintrust eval <file> --dev --dev-port <port>
test_expect.py- TS path:
npx braintrust eval --verbose --terminate-on-failure ... --env-file ... <file> - PY path:
braintrust eval --verbose --num-workers 4 --terminate-on-failure ... --env-file ... <file>
- TS path:
Current fixtures under tests/evals/ now include:
- JS module system coverage:
eval-esm,eval-cjs,eval-ts-esm,eval-ts-cjs
- JS execution mode coverage:
entrypoint-basic,direct-basic
- JS runtime compatibility coverage:
eval-ts-esmruns with bothtsxandbunrunners from one fixtureeval-buncovers Bun-only APIs (bun,bun:sqlite,Bun.file)
- Python import behavior coverage:
basic,local_import,relative,absolute
These cover the major interoperability scenarios from braintrust-sdk/js/cli-tests plus Python import quirks that show up in braintrust expect tests.
- Add CLI flags needed by
test_expect.py:--verbose,--terminate-on-failure,--env-file,--num-workers. - Add evaluation control flags:
--list,--filter,--jsonl. - Add remote mode:
--dev,--dev-host,--dev-port,--dev-org-name. - Add directory discovery and glob matching parity.
- Tighten parity output for progress/summary formatting.
- Add fixture cases that directly assert new flags as they are implemented (especially
--list,--filter,--jsonl,--dev). - Add negative fixtures for incompatible runtime/file combos (expected failures).
- Keep runtime matrix in one fixture via
runnersto avoid file duplication across runtimes.