Official Benchmark Pack (v0.1 and v0.2)

The Official Benchmark Pack is a minimal, stable benchmark pack that external researchers can run and compare against. It uses a fixed set of tasks (A–H), scales (S/M/L), official baselines per task, official coordination methods, and security suite (smoke/full) definitions. Results semantics v0.2 remains canonical. The CLI is backward-compatible.

v0.1 (benchmark_pack.v0.1.yaml): Default for deterministic and scripted runs.
v0.2 (benchmark_pack.v0.2.yaml): Used automatically when --pipeline-mode llm_live. Adds the live coordination evaluation protocol (required metadata, cost accounting, reproducibility expectations) and writes TRANSPARENCY_LOG/llm_live.json and live_evaluation_metadata.json.

Policy

Pack definition (default): policy/official/benchmark_pack.v0.1.yaml
Pack definition (llm_live): policy/official/benchmark_pack.v0.2.yaml (loaded when --pipeline-mode llm_live)

tasks: core (throughput_sla through insider_key_misuse), coordination (coord_scale, coord_risk), experimental (optional device_outage_surge, reagent_stockout)
scale_configs: S (small), M (medium), L (large) with num_agents_total, horizon_steps, etc.
baselines: For core tasks, baseline/agent ID (e.g. scripted_ops_v1, adversary_v1). For coord_scale/coord_risk, the value is the coordination method ID (e.g. kernel_scheduler_or_v0; the runner strips _v0 to get the method_id). See Glossary.
coordination_methods: list of method_ids (centralized_planner, hierarchical_hub_rr, llm_constrained)
security_suite: smoke (enabled by default), full (optional)
required_reports: security, safety_case, transparency_log

v0.2 adds live_coordination_evaluation_protocol with required_metadata_fields (model_id, temperature, tool_registry_fingerprint, allow_network), cost_accounting, and reproducibility_expectations (same seed/model/policy give comparable but not bit-identical results; attestation via hashes only).

Exact command

labtrust run-official-pack --out <dir> --seed-base N

--out (required): Output directory. Single folder ready to upload as "official pack result".
--seed-base (default: 100): Base seed for deterministic runs.
--smoke: Use smoke settings (fewer episodes, security smoke-only). Default when LABTRUST_OFFICIAL_PACK_SMOKE=1 or LABTRUST_PAPER_SMOKE=1.
--no-smoke: Disable smoke; run full pack (more episodes).
--full: Run full security suite instead of smoke-only.
--pipeline-mode: deterministic (default) or llm_live. With llm_live, the pack loads v0.2 policy, runs baselines with a live LLM backend, writes TRANSPARENCY_LOG/llm_live.json (prompt hashes, tool registry fingerprint, model identifiers, latency/cost stats; no sensitive prompt text) and live_evaluation_metadata.json (model_id, temperature, tool_registry_fingerprint, allow_network).
--llm-backend: When --pipeline-mode llm_live, which backend to use: openai_live, anthropic_live, or ollama_live. Default is openai_live if not set. Requires the corresponding extra (e.g. .[llm_anthropic]) and env vars (e.g. ANTHROPIC_API_KEY).
--allow-network: Allow network access (required for llm_live when using a remote API).
--include-coordination-pack: Run the coordination security pack into coordination_pack/ and build the lab report there (pack_summary.csv, pack_gate.md, SECURITY/coordination_risk_matrix., LAB_COORDINATION_REPORT.md, COORDINATION_DECISION.). Uses the pack policy coordination_pack.matrix_preset (default hospital_lab) when set; otherwise you can enable it in policy/official/benchmark_pack.v0.1.yaml with coordination_pack: { enabled: true, matrix_preset: hospital_lab }. See Coordination studies and Hospital lab full pipeline.

Example (smoke, fast):

LABTRUST_OFFICIAL_PACK_SMOKE=1 labtrust run-official-pack --out ./official_pack_result --seed-base 100

Example (full pack):

labtrust run-official-pack --out ./official_pack_result --seed-base 42 --no-smoke --full

Example (with coordination pack and lab report):

labtrust run-official-pack --out ./official_pack_result --seed-base 100 --include-coordination-pack

This produces coordination_pack/ under the output dir with pack_summary.csv, pack_gate.md, SECURITY/coordination_risk_matrix., LAB_COORDINATION_REPORT.md, and COORDINATION_DECISION.. See Coordination studies for the canonical flow.

Example (llm_live, external reproducibility):

labtrust run-official-pack --out ./official_pack_result --seed-base 100 --pipeline-mode llm_live --allow-network

Report model_id and temperature (from live_evaluation_metadata.json or env) for reproducibility. The validator and risk-register exporter accept pack output that includes TRANSPARENCY_LOG/llm_live.json and live_evaluation_metadata.json; these are linked in the risk register when present.

Cross-provider pack (run once per backend)

Single entry point for cross-provider comparison: Run the same official pack with multiple LLM backends and get comparable outputs plus a merged summary. No separate matrix script is required.

labtrust run-cross-provider-pack --out <dir> --providers openai_live,anthropic_live,ollama_live [--seed-base N] [--smoke | --no-smoke]

--out (required): Base output directory. Each provider gets a subdir: <out>/openai_live/, <out>/anthropic_live/, <out>/ollama_live/ with the same tree as a single run-official-pack (baselines/, SECURITY/, SAFETY_CASE/, TRANSPARENCY_LOG/, llm_live.json, live_evaluation_metadata.json).
--providers: Comma-separated list of backend IDs (openai_live, anthropic_live, ollama_live). Each must be installed and configured (API keys or local URL).
--seed-base: Base seed for deterministic runs (default 100).
--smoke / --no-smoke: Use smoke settings (fewer episodes) or full pack; default follows LABTRUST_OFFICIAL_PACK_SMOKE / LABTRUST_PAPER_SMOKE.

The command also writes <out>/summary_cross_provider.json and summary_cross_provider.md with model_id and mean_latency_ms per provider run for quick comparison.

Expected output tree

After a successful run, <out> contains:

<out>/
  baselines/
    results/
      throughput_sla_scripted_ops.json
      stat_insertion_scripted_ops.json
      qc_cascade_scripted_ops.json
      adversarial_disruption_adversary.json
      multi_site_stat_scripted_ops.json
      insider_key_misuse_insider.json
      coord_scale_kernel_scheduler_or.json
      coord_risk_kernel_scheduler_or.json
  SECURITY/
    attack_results.json
    coverage.json
    coverage.md
    reason_codes.md
    deps_inventory_runtime.json
    (securitization packet)
  SAFETY_CASE/
    safety_case.json
    safety_case.md
  TRANSPARENCY_LOG/
    log.json
    root.txt
    proofs/
    (or README.txt if no receipts yet)
    llm_live.json          (only when --pipeline-mode llm_live)
  coordination_pack/       (only when --include-coordination-pack or coordination_pack.enabled)
    pack_summary.csv
    pack_gate.md
    SECURITY/coordination_risk_matrix.csv, .md
    LAB_COORDINATION_REPORT.md
    COORDINATION_DECISION.v0.1.json, .md
    summary/
      sota_leaderboard.csv, sota_leaderboard.md     (main: key metrics, optional run metadata)
      sota_leaderboard_full.csv, sota_leaderboard_full.md
      method_class_comparison.csv, method_class_comparison.md
  pack_manifest.json
  live_evaluation_metadata.json   (only when --pipeline-mode llm_live)
  PACK_SUMMARY.md

Validation: Each results JSON under baselines/results/ conforms to policy/schemas/results.v0.2.schema.json. You can run labtrust summarize-results --in <out>/baselines/results/ --out <dir> and use the summarize module's validate_results_v02() (or the CLI validate path) to verify. The pack layout and filenames above are stable for tooling and scripts. When --in is a directory, summarization discovers: (1) any file matching **/results*.json, and (2) any *.json inside a subdirectory named results. So pointing --in at the pack output dir or at baselines/results/ will include all per-task result files.

baselines/results/
One results.v0.2-semantics JSON per task (task name + baseline suffix). Produced by the same logic as generate-official-baselines but driven by the pack policy.
SECURITY/
Security attack suite outputs (smoke or full) and securitization packet. Same layout as labtrust run-security-suite --out <dir>.
SAFETY_CASE/
Safety case (claim → control → test → artifact → command). Same as labtrust safety-case --out <dir>.
coordination_pack/ (when --include-coordination-pack or pack policy coordination_pack.enabled)
Coordination security pack output plus lab report: pack_summary.csv, pack_gate.md, SECURITY/coordination_risk_matrix., LAB_COORDINATION_REPORT.md, COORDINATION_DECISION., and under summary/: sota_leaderboard (main), sota_leaderboard_full (full metrics), method_class_comparison (with blocks_mean, attack_success_rate_mean). See Coordination studies and Hospital lab key metrics.
TRANSPARENCY_LOG/
Global transparency log over episode digests (when receipts or _repr exist). Otherwise a README.txt explains how to produce it (e.g. export-receipts then transparency-log).
pack_manifest.json
Machine-readable manifest: version, pack_policy path, seed_base, timestamp, git_sha, smoke, pipeline_mode, allow_network, tasks, baselines, scale_configs, coordination_methods, required_reports, results_semantics.
TRANSPARENCY_LOG/llm_live.json (llm_live only)
Transparency log for llm_live: prompt_hashes, tool_registry_fingerprint, model_version_identifiers, latency_and_cost_statistics, per_task. Reviewable without exposing sensitive prompt text.
live_evaluation_metadata.json (llm_live only)
Required protocol fields: model_id, temperature, tool_registry_fingerprint, allow_network.
PACK_SUMMARY.md
Human-readable pack summary table and output tree (as above).

Verification

To check that a pack result is valid and that receipts (if present) verify:

labtrust verify-bundle --bundle <out>/receipts/<task>

Smoke tests (tests/test_official_pack_smoke.py) run the pack with LABTRUST_OFFICIAL_PACK_SMOKE=1 (or LABTRUST_PAPER_SMOKE=1), assert required folders exist, and run verify-bundle where applicable.

Relation to package-release

The paper_v0.1 profile of labtrust package-release references the official pack policy and includes a pack summary table (e.g. in RELEASE_NOTES or PACK_SUMMARY) so that the paper artifact documents the same benchmark pack definition.

Backward compatibility

Existing CLI commands (run-benchmark, generate-official-baselines, run-security-suite, safety-case, transparency-log, package-release) are unchanged. run-official-pack composes them into one folder for the community to replicate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Official Benchmark Pack (v0.1 and v0.2)

Policy

Exact command

Cross-provider pack (run once per backend)

Expected output tree

Verification

Relation to package-release

Backward compatibility

FilesExpand file tree

official_benchmark_pack.md

Latest commit

History

official_benchmark_pack.md

File metadata and controls

Official Benchmark Pack (v0.1 and v0.2)

Policy

Exact command

Cross-provider pack (run once per backend)

Expected output tree

Verification

Relation to package-release

Backward compatibility