LENS is a role-aware multi-agent grading pipeline for clinical summaries. The same summary is scored in parallel by three role-specific agents:
PhysicianTriage NurseBedside Nurse
Each role scores the summary across 8 rubric dimensions on a 1-5 scale. The system then computes a role-level weighted overall score, a cross-role overall score, and an Orchestrator Disagreement view that shows how far the three role scores differ on each dimension.
- Parallel scoring by three role-specific agents
- Shared 8-dimension LENS rubric
- Two scoring modes:
llm: OpenAI model-based scoringheuristic: local baseline scoring without API calls
- Per-role weighted overall scoring based on questionnaire-derived role priors
- Orchestrator validation, disagreement mapping, and score aggregation
- Human-readable and JSON outputs
- Input a clinical summary.
- Load rubric definitions and role configurations.
- Run the three role agents in parallel.
- Validate each role scorecard.
- Build an
Orchestrator Disagreementmap for all 8 dimensions. - Aggregate the role outputs into:
- per-role scores
- per-role overall scores
- final overall score across roles
src/grading_pipeline/Python package (published on PyPI asedlens).cli.py— Command-line entrypoint and human-readable output formattingorchestrator.py— Multi-agent pipeline, validation, disagreement mapping, and aggregationllm_scoring.py— LLM-based scoring logicscoring.py— Heuristic baseline scoring and score utilitiesopenai_client.py— Minimal OpenAI Responses API clientconfig.py— Rubric/role configuration loadersvalidation.py— Input validation
config/lens_rubric.json— 8 rubric dimensions and evaluation focusroles.json— Role agents, persona metadata, andw_priorweightsrole_profiles/— Role-specific LLM scoring profiles
schemas/agent_output.schema.json— JSON Schema for structured agent output
docs/— API reference (mkdocs + mkdocs-material)tests/— Input-validation and orchestrator testsDockerfile— Container image for running the pipeline
- Python 3.12+
- OpenAI API key for
llmmode
pip install edlensOr install from source with dev and docs extras:
pip install -e ".[dev,docs]"The package has no runtime dependencies beyond the Python standard library.
If you want to run the LLM pipeline, you must use your own OpenAI API key.
Create a file named .env in the project root.
Project root:
- same folder as
README.md - same folder as
config/ - same folder as
grading_pipeline/
Expected file location:
LENS Project/.envAdd the following line to .env:
OPENAI_API_KEY=your_openai_api_key_hereOptional override:
OPENAI_BASE_URL=https://api.openai.com/v1/responsesYou can use .env.example as the template:
cp .env.example .envImportant notes:
.envis already ignored by git and should not be committed.- If you run with
--engine heuristic, no API key is required. - The code reads
OPENAI_API_KEYfrom.envfirst, then falls back to your shell environment.
Run with the default LLM mode:
python -m grading_pipeline --summary "Your summary here"
# or, using the installed CLI entry point:
lens --summary "Your summary here"Run with the heuristic baseline:
lens --engine heuristic --summary "Your summary here"Use a summary file:
lens --summary-file path/to/summary.txtOutput JSON instead of the human-readable report:
lens --summary "Your summary here" --format json --prettySelect a specific model:
lens --model gpt-4o-mini --summary "Your summary here"Adjust the disagreement threshold:
lens --gap-threshold 0.5 --summary "Your summary here"docker build -t lens .
docker run lens --summary "Your summary here" --engine heuristicThe CLI validates summary input before the scoring pipeline runs.
The summary must:
- be provided through
--summaryor--summary-file - not be empty
- not be whitespace only
- be at least
30characters after trimming whitespace
If the summary is invalid, the CLI exits with a non-zero code and no scoring call is made.
The human-readable output includes:
- role-by-role scores for all 8 dimensions
- a weighted
Overallscore for each role Orchestrator Disagreementshowing score gaps per dimension- final
Overall Scoreacross all three roles
Example output shape:
----------------------------------------
Role-Aware Multi-Agent Grading Pipeline:
----------------------------------------
Physician:
Factual Accuracy: 5.0
Relevant Chronic Problem Coverage: 4.0
...
Overall: 4.12
----------------------------------------
Triage Nurse:
...
----------------------------------------
Bedside Nurse:
...
----------------------------------------
----------------------------------------
Orchestrator Disagreement:
----------------------------------------
Factual Accuracy: 1.0
Relevant Chronic Problem Coverage: 0.0
...
----------------------------------------
Overall Score: 4.0
Each role has its own prior weights in config/roles.json.
Role-level overall score:
Role Overall = weighted average of the 8 dimension scores
Cross-role overall score:
Overall Score = average of the 3 role overall scores
Disagreement per dimension:
Gap = highest agent score - lowest agent score
Run the test suite:
pytest -qCurrent tests cover:
- CLI summary input validation
- disagreement-map correctness
- validation and repair behavior
- conditional adjudication behavior
- weighted aggregation behavior
The current implementation includes:
- parallel three-role scoring
- role-aware weighting
- strict input validation
- orchestrator disagreement reporting
- weighted final score aggregation
- human-readable report formatting for demo and presentation use
