Benchmark for statistically valid AI scientist systems, using audit-closed protocols, transparency logs, and sequential inference to prevent false discoveries in autonomous research agents.
reproducible-science audit-log ai-agents p-hacking scientific-machine-learning scientific-discovery autonomous-research transparency-log statistical-validity e-values self-driving-lab ai-governance automated-science research-automation agentic-ai ai-scientist sequential-inference deterministic-replay e-process optional-stopping
-
Updated
Mar 5, 2026 - Python