The Official Benchmark Pack is a minimal, stable benchmark pack that external researchers can run and compare against. It uses a fixed set of tasks (A–H), scales (S/M/L), official baselines per task, official coordination methods, and security suite (smoke/full) definitions. Results semantics v0.2 remains canonical. The CLI is backward-compatible.
- v0.1 (
benchmark_pack.v0.1.yaml): Default for deterministic and scripted runs. - v0.2 (
benchmark_pack.v0.2.yaml): Used automatically when--pipeline-mode llm_live. Adds the live coordination evaluation protocol (required metadata, cost accounting, reproducibility expectations) and writesTRANSPARENCY_LOG/llm_live.jsonandlive_evaluation_metadata.json.
Pack definition (default): policy/official/benchmark_pack.v0.1.yaml
Pack definition (llm_live): policy/official/benchmark_pack.v0.2.yaml (loaded when --pipeline-mode llm_live)
- tasks: core (throughput_sla through insider_key_misuse), coordination (coord_scale, coord_risk), experimental (optional device_outage_surge, reagent_stockout)
- scale_configs: S (small), M (medium), L (large) with num_agents_total, horizon_steps, etc.
- baselines: For core tasks, baseline/agent ID (e.g. scripted_ops_v1, adversary_v1). For coord_scale/coord_risk, the value is the coordination method ID (e.g. kernel_scheduler_or_v0; the runner strips _v0 to get the method_id). See Glossary.
- coordination_methods: list of method_ids (centralized_planner, hierarchical_hub_rr, llm_constrained)
- security_suite: smoke (enabled by default), full (optional)
- required_reports: security, safety_case, transparency_log
v0.2 adds live_coordination_evaluation_protocol with required_metadata_fields (model_id, temperature, tool_registry_fingerprint, allow_network), cost_accounting, and reproducibility_expectations (same seed/model/policy give comparable but not bit-identical results; attestation via hashes only).
labtrust run-official-pack --out <dir> --seed-base N--out(required): Output directory. Single folder ready to upload as "official pack result".--seed-base(default: 100): Base seed for deterministic runs.--smoke: Use smoke settings (fewer episodes, security smoke-only). Default whenLABTRUST_OFFICIAL_PACK_SMOKE=1orLABTRUST_PAPER_SMOKE=1.--no-smoke: Disable smoke; run full pack (more episodes).--full: Run full security suite instead of smoke-only.--pipeline-mode:deterministic(default) orllm_live. Withllm_live, the pack loads v0.2 policy, runs baselines with a live LLM backend, writesTRANSPARENCY_LOG/llm_live.json(prompt hashes, tool registry fingerprint, model identifiers, latency/cost stats; no sensitive prompt text) andlive_evaluation_metadata.json(model_id, temperature, tool_registry_fingerprint, allow_network).--llm-backend: When--pipeline-mode llm_live, which backend to use:openai_live,anthropic_live, orollama_live. Default isopenai_liveif not set. Requires the corresponding extra (e.g..[llm_anthropic]) and env vars (e.g.ANTHROPIC_API_KEY).--allow-network: Allow network access (required for llm_live when using a remote API).--include-coordination-pack: Run the coordination security pack intocoordination_pack/and build the lab report there (pack_summary.csv, pack_gate.md, SECURITY/coordination_risk_matrix., LAB_COORDINATION_REPORT.md, COORDINATION_DECISION.). Uses the pack policycoordination_pack.matrix_preset(defaulthospital_lab) when set; otherwise you can enable it inpolicy/official/benchmark_pack.v0.1.yamlwithcoordination_pack: { enabled: true, matrix_preset: hospital_lab }. See Coordination studies and Hospital lab full pipeline.
Example (smoke, fast):
LABTRUST_OFFICIAL_PACK_SMOKE=1 labtrust run-official-pack --out ./official_pack_result --seed-base 100Example (full pack):
labtrust run-official-pack --out ./official_pack_result --seed-base 42 --no-smoke --fullExample (with coordination pack and lab report):
labtrust run-official-pack --out ./official_pack_result --seed-base 100 --include-coordination-packThis produces coordination_pack/ under the output dir with pack_summary.csv, pack_gate.md, SECURITY/coordination_risk_matrix., LAB_COORDINATION_REPORT.md, and COORDINATION_DECISION.. See Coordination studies for the canonical flow.
Example (llm_live, external reproducibility):
labtrust run-official-pack --out ./official_pack_result --seed-base 100 --pipeline-mode llm_live --allow-networkReport model_id and temperature (from live_evaluation_metadata.json or env) for reproducibility. The validator and risk-register exporter accept pack output that includes TRANSPARENCY_LOG/llm_live.json and live_evaluation_metadata.json; these are linked in the risk register when present.
Single entry point for cross-provider comparison: Run the same official pack with multiple LLM backends and get comparable outputs plus a merged summary. No separate matrix script is required.
labtrust run-cross-provider-pack --out <dir> --providers openai_live,anthropic_live,ollama_live [--seed-base N] [--smoke | --no-smoke]--out(required): Base output directory. Each provider gets a subdir:<out>/openai_live/,<out>/anthropic_live/,<out>/ollama_live/with the same tree as a singlerun-official-pack(baselines/, SECURITY/, SAFETY_CASE/, TRANSPARENCY_LOG/, llm_live.json, live_evaluation_metadata.json).--providers: Comma-separated list of backend IDs (openai_live,anthropic_live,ollama_live). Each must be installed and configured (API keys or local URL).--seed-base: Base seed for deterministic runs (default 100).--smoke/--no-smoke: Use smoke settings (fewer episodes) or full pack; default followsLABTRUST_OFFICIAL_PACK_SMOKE/LABTRUST_PAPER_SMOKE.
The command also writes <out>/summary_cross_provider.json and summary_cross_provider.md with model_id and mean_latency_ms per provider run for quick comparison.
After a successful run, <out> contains:
<out>/
baselines/
results/
throughput_sla_scripted_ops.json
stat_insertion_scripted_ops.json
qc_cascade_scripted_ops.json
adversarial_disruption_adversary.json
multi_site_stat_scripted_ops.json
insider_key_misuse_insider.json
coord_scale_kernel_scheduler_or.json
coord_risk_kernel_scheduler_or.json
SECURITY/
attack_results.json
coverage.json
coverage.md
reason_codes.md
deps_inventory_runtime.json
(securitization packet)
SAFETY_CASE/
safety_case.json
safety_case.md
TRANSPARENCY_LOG/
log.json
root.txt
proofs/
(or README.txt if no receipts yet)
llm_live.json (only when --pipeline-mode llm_live)
coordination_pack/ (only when --include-coordination-pack or coordination_pack.enabled)
pack_summary.csv
pack_gate.md
SECURITY/coordination_risk_matrix.csv, .md
LAB_COORDINATION_REPORT.md
COORDINATION_DECISION.v0.1.json, .md
summary/
sota_leaderboard.csv, sota_leaderboard.md (main: key metrics, optional run metadata)
sota_leaderboard_full.csv, sota_leaderboard_full.md
method_class_comparison.csv, method_class_comparison.md
pack_manifest.json
live_evaluation_metadata.json (only when --pipeline-mode llm_live)
PACK_SUMMARY.md
Validation: Each results JSON under baselines/results/ conforms to policy/schemas/results.v0.2.schema.json. You can run labtrust summarize-results --in <out>/baselines/results/ --out <dir> and use the summarize module's validate_results_v02() (or the CLI validate path) to verify. The pack layout and filenames above are stable for tooling and scripts. When --in is a directory, summarization discovers: (1) any file matching **/results*.json, and (2) any *.json inside a subdirectory named results. So pointing --in at the pack output dir or at baselines/results/ will include all per-task result files.
-
baselines/results/
Oneresults.v0.2-semantics JSON per task (task name + baseline suffix). Produced by the same logic asgenerate-official-baselinesbut driven by the pack policy. -
SECURITY/
Security attack suite outputs (smoke or full) and securitization packet. Same layout aslabtrust run-security-suite --out <dir>. -
SAFETY_CASE/
Safety case (claim → control → test → artifact → command). Same aslabtrust safety-case --out <dir>. -
coordination_pack/ (when
--include-coordination-packor pack policycoordination_pack.enabled)
Coordination security pack output plus lab report: pack_summary.csv, pack_gate.md, SECURITY/coordination_risk_matrix., LAB_COORDINATION_REPORT.md, COORDINATION_DECISION., and under summary/: sota_leaderboard (main), sota_leaderboard_full (full metrics), method_class_comparison (with blocks_mean, attack_success_rate_mean). See Coordination studies and Hospital lab key metrics. -
TRANSPARENCY_LOG/
Global transparency log over episode digests (when receipts or _repr exist). Otherwise a README.txt explains how to produce it (e.g. export-receipts then transparency-log). -
pack_manifest.json
Machine-readable manifest: version, pack_policy path, seed_base, timestamp, git_sha, smoke, pipeline_mode, allow_network, tasks, baselines, scale_configs, coordination_methods, required_reports, results_semantics. -
TRANSPARENCY_LOG/llm_live.json (llm_live only)
Transparency log for llm_live: prompt_hashes, tool_registry_fingerprint, model_version_identifiers, latency_and_cost_statistics, per_task. Reviewable without exposing sensitive prompt text. -
live_evaluation_metadata.json (llm_live only)
Required protocol fields: model_id, temperature, tool_registry_fingerprint, allow_network. -
PACK_SUMMARY.md
Human-readable pack summary table and output tree (as above).
To check that a pack result is valid and that receipts (if present) verify:
labtrust verify-bundle --bundle <out>/receipts/<task>Smoke tests (tests/test_official_pack_smoke.py) run the pack with LABTRUST_OFFICIAL_PACK_SMOKE=1 (or LABTRUST_PAPER_SMOKE=1), assert required folders exist, and run verify-bundle where applicable.
The paper_v0.1 profile of labtrust package-release references the official pack policy and includes a pack summary table (e.g. in RELEASE_NOTES or PACK_SUMMARY) so that the paper artifact documents the same benchmark pack definition.
Existing CLI commands (run-benchmark, generate-official-baselines, run-security-suite, safety-case, transparency-log, package-release) are unchanged. run-official-pack composes them into one folder for the community to replicate.