Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: CI

on:
push:
pull_request:

jobs:
test:
name: Python tests
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.11", "3.12"]

steps:
- name: Check out repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: pip

- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

- name: Run tests
run: python -m pytest -q

- name: Compile Python sources
run: python -m compileall scripts src
34 changes: 29 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The project asks two questions in parallel:
2. What does a defensible, auditable AI-assisted evaluation pipeline look like for
legal and institutional text research?

**Status (as of April 2026):** manuscript draft in preparation; 20,768 LLM response
**Status (as of May 3, 2026):** manuscript draft in preparation; 20,768 LLM response
cells collected across 22 model configurations; human-validation pilot underway; full
data release pending licensing, privacy, and source-distribution review.

Expand Down Expand Up @@ -110,6 +110,29 @@ model outputs are treated as evidence to be validated rather than accepted, and
decisions are documented through schemas, rubrics, provenance notes, and rerunnable
scripts.

## Relevance To Human-AI Oversight

Although the current public snapshot focuses on legal and institutional text,
LegalBenchPro is also a compact example of the research infrastructure needed for
human-AI oversight work. The project turns domain-heavy tasks into auditable model
evaluation artifacts, then separates model outputs, scoring rubrics, reviewer-facing
protocols, and validation samples so that human judgments can be compared against AI
judgments rather than hidden inside a single aggregate score.

The pieces most relevant to human-AI complementarity and scalable oversight are:

- **annotation protocol design:** task instructions, score anchors, and reviewer notes
are documented in `docs/ANNOTATION_PROTOCOL.md` and `docs/SCORING_RUBRIC.md`;
- **human validation staging:** pilot rows are selected for expert review before
benchmark-level claims are made;
- **audit trails:** AI-assisted coding, scoring, and release decisions are separated in
`docs/AI_WORKFLOW.md`, metadata files, and reproducible scripts;
- **model failure analysis:** the workflow tracks answer consistency, factual grounding,
citation relevance, and cross-setting transfer failures across model configurations;
- **reproducible collaboration:** public samples, metadata, tests, and manuscript-status
notes make it easier for collaborators to inspect what is complete, what is private,
and what still needs validation.

## Where To Start

For a quick review of the project, start with:
Expand Down Expand Up @@ -203,19 +226,20 @@ The repository includes a small test suite:
macOS/Linux:

```bash
export PYTHONPATH="$PWD/src"
python -m unittest discover -s tests
python -m pytest -q
python -m compileall scripts src
```

Windows PowerShell:

```powershell
$env:PYTHONPATH = "$PWD\src"
python -m unittest discover -s tests
python -m pytest -q
python -m compileall scripts src
```

The `pyproject.toml` test configuration points pytest at the `src/` package layout, so
manual `PYTHONPATH` setup is not required for local validation.

## Research Software Signals

This repository is intentionally organized as a research-engineering artifact, not only
Expand Down
21 changes: 21 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
[build-system]
requires = ["setuptools>=68"]
build-backend = "setuptools.build_meta"

[project]
name = "legalbenchpro"
version = "0.1.0"
description = "Reproducible public utilities for the LegalBenchPro benchmark preview."
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"matplotlib>=3.8",
"openpyxl>=3.1.2",
]

[tool.setuptools.packages.find]
where = ["src"]

[tool.pytest.ini_options]
pythonpath = ["src"]
testpaths = ["tests"]
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
matplotlib>=3.8
openpyxl>=3.1.2
pytest>=8
Loading