Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
4727f1c
Unify winrate HF upload to hf repo with winrate_dir
warner-benjamin Feb 27, 2026
526078e
Remove process manifest preflight
warner-benjamin Feb 28, 2026
2eac1af
Simplify process selection and delta handling
warner-benjamin Feb 28, 2026
0592eac
Add explicit process replace filters
warner-benjamin Feb 28, 2026
958df70
Split process CLI orchestration
warner-benjamin Feb 28, 2026
8928f33
Break up process row loading
warner-benjamin Feb 28, 2026
d31bcb2
Separate process workspace preparation
warner-benjamin Feb 28, 2026
11ad919
Harden process selection against stale broken runs
warner-benjamin Mar 1, 2026
10079ee
Use row identities in process aggregation
warner-benjamin Mar 1, 2026
c4a3855
check in files agents missed
warner-benjamin Mar 1, 2026
29fe167
Gate process runs by manifest missing pct
warner-benjamin Mar 1, 2026
8b4dc75
Report winrate dataset missingness
warner-benjamin Mar 1, 2026
e9b36aa
Tighten process missing-pct selection
warner-benjamin Mar 1, 2026
75f7d31
small fixes
warner-benjamin Mar 1, 2026
9c69916
improved results missing percent logic
warner-benjamin Mar 1, 2026
0a8f5bc
fixes and use hf_token env
warner-benjamin Mar 1, 2026
3ae2d98
bug fixes and improvements
warner-benjamin Mar 2, 2026
ab661a2
restart interupted uploads
warner-benjamin Mar 4, 2026
7971a94
Improve grading and xml parsing
warner-benjamin Mar 12, 2026
683fe4a
Improve negation handling in MCQ grading
warner-benjamin Mar 12, 2026
93f7894
Cleaner reasoning stripping
warner-benjamin Mar 12, 2026
60360c5
Improve MCQ answer text normalization
warner-benjamin Mar 12, 2026
b1d22f4
Tighten compact multi-answer detection
warner-benjamin Mar 16, 2026
7aab7a5
catch more edge cases
warner-benjamin Mar 17, 2026
97d3282
performance fix for long answers
warner-benjamin Mar 17, 2026
044f0b6
only use tex stripping if we detect tex-like text
warner-benjamin Mar 17, 2026
ac4fcb1
temp performance checking code
warner-benjamin Mar 17, 2026
62a5629
refactor
warner-benjamin Mar 18, 2026
2f057cf
refactor again
warner-benjamin Apr 19, 2026
21dd099
Merge remote-tracking branch 'origin/main' into improve_grading_parsing
warner-benjamin Apr 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 151 additions & 29 deletions docs/medarc-eval-process.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Convert raw benchmark outputs into analysis-ready parquet files. This step prepa
## Quick Start

```bash
# Process all completed runs (uses defaults)
# Process all completed jobs (uses defaults)
medarc-eval process

# Specify directories explicitly
Expand All @@ -17,33 +17,35 @@ medarc-eval process --dry-run

## What Processing Does

1. **Discovers** completed jobs in `runs/raw/`
1. **Discovers** jobs in `runs/raw/` and filters by manifest status (default: `completed`)
2. **Extracts** results from each job's output files
3. **Normalizes** data into a consistent schema
4. **Writes** parquet files organized by environment and model
3. **Normalizes** data into a fixed output schema
4. **Writes** parquet files organized by model and environment
5. **Creates** an index (`env_index.json`) for downstream tools

### Output Structure

```
runs/processed/
├── env_index.json # Dataset inventory for winrate/analysis
├── medqa/
│ ├── gpt-4o.parquet
│ └── gpt-4o-mini.parquet
├── pubmedqa/
│ ├── gpt-4o.parquet
│ └── gpt-4o-mini.parquet
├── gpt-4o/
│ ├── medqa.parquet
│ └── pubmedqa.parquet
├── gpt-4o-mini/
│ ├── medqa.parquet
│ └── pubmedqa.parquet
└── ...
```

On-disk model and env path components are slugified, so filenames may not exactly match raw ids.

## Common Options

| Flag | Description | Default |
|------|-------------|---------|
| `--runs-dir PATH` | Directory containing raw runs | `runs/raw` |
| `--output-dir PATH` | Where to write processed files | `runs/processed` |
| `--max-workers N` | Parallel processing threads | 4 |
| `--max-workers N` | Parallel worker processes | 4 |
| `--dry-run` | Show what would be processed | - |
| `--yes` | Skip confirmation prompts | - |
| `--exclude-dataset NAME` | Skip processing specific datasets/env ids (repeatable) | - |
Expand All @@ -53,16 +55,35 @@ runs/processed/

### By Completion Status

By default, only completed jobs are processed:
By default, `medarc-eval process` only selects jobs whose manifest status is `completed`.

```bash
# Include incomplete runs
medarc-eval process --process-incomplete
Note: successful jobs are written to `run_manifest.json` with `status: completed`.

# Filter by specific status
To override that default, pass one or more explicit status filters:

```bash
medarc-eval process --status completed --status failed
```

You can also gate partially complete outputs by missing `results.jsonl` rows:

```bash
# Default tolerance is 2.5 percent missing
medarc-eval process --max-results-missing-pct 2.5

# Effectively disable the gate
medarc-eval process --max-results-missing-pct 100
```

This gate uses manifest job metadata only:

- `expected_rows = num_examples * rollouts_per_example`
- `observed_rows = row_count`

It is computed per selected job record and enforced only on the latest selected run for each processed model/environment output. It does not use manifest `summary.completed` / `summary.total`, and it does not fall back to older runs if the latest one is too incomplete.

Selected records with missing `results.jsonl` fail processing immediately.

### Latest Runs Only

When multiple runs exist for the same (model, environment) pair, processing uses the latest by default.
Expand All @@ -86,13 +107,19 @@ Store common options in a YAML file:
```yaml
# process-config.yaml
runs_dir: runs/raw
output_dir: runs/processed
max_workers: 8
process_incomplete: false
exclude_datasets:
- med_dialog
exclude_models:
- deprecated-v1

process:
dir: processed
max_workers: 8
max_results_missing_pct: 2.5
exclude_datasets:
- med_dialog
exclude_models:
- deprecated-v1

winrate:
enabled: true
dir: winrate
```

```bash
Expand All @@ -101,14 +128,44 @@ medarc-eval process --config process-config.yaml

CLI flags override config values.

Supported config schema for `medarc-eval process`:

- Top-level `runs_dir`: raw run root.
- Top-level `process:`: process-specific defaults.
- Optional top-level `winrate:`: embedded post-process winrate step.
- Optional top-level `hf:`: shared HF settings. For embedded winrate uploads, use `hf.winrate_dir`.

Path shortcuts:

- `process.dir` is shorthand for `process.output_dir`, resolved relative to the parent of `runs_dir`.
- `winrate.dir` is shorthand for the embedded winrate output directory, resolved under the processed output dir.

Example:

```yaml
runs_dir: runs/raw

process:
dir: processed
max_workers: 8

winrate:
dir: scorecards

hf:
repo: your-org/medical-benchmarks
winrate_dir: scorecards/latest
```

## Hugging Face Integration

Sync processed datasets to/from the Hugging Face Hub:

```yaml
# process-config.yaml
runs_dir: runs/raw
output_dir: runs/processed
process:
dir: processed

hf:
repo: your-org/medical-benchmarks
Expand All @@ -117,6 +174,8 @@ hf:
private: true
```

`hf.token` accepts either a literal token string or an environment reference like `$HF_TOKEN` / `${HF_TOKEN}`.

### Pull Before Processing

```bash
Expand All @@ -128,8 +187,24 @@ medarc-eval process --hf-repo your-org/data --hf-pull-policy pull

# Start fresh (ignore remote)
medarc-eval process --hf-repo your-org/data --hf-pull-policy clean

# Resume a previously failed HF upload without pulling or cleaning
medarc-eval process --hf-repo your-org/data --hf-pull-policy continue-upload
```

`prompt` only prompts when the local processed dir is already non-empty. If the output dir is empty, process pulls the HF baseline immediately.

When `prompt` is used with a non-empty local processed dir, the menu may show:

- `pull`: download missing baseline data without deleting local files
- `clean`: redownload everything after deleting local files
- `upload`: keep local processed outputs and resume/upload pending HF artifacts

`upload` is shown only when local parquet files appear to be missing remotely or have a different remote `lfs.sha256`. Recovery uploads the union of:

- parquet files that were already pending before the current run started
- files touched by the current process run, including `env_index.json` and `dataset_infos.json` when rewritten

### Push After Processing

When `--hf-repo` is set, processed files are automatically uploaded after completion.
Expand All @@ -139,10 +214,10 @@ When `--hf-repo` is set, processed files are automatically uploaded after comple
Process and compute win rates in one step:

```bash
medarc-eval process --winrate winrate-config.yaml
medarc-eval process --config process-config.yaml
```

This runs `medarc-eval winrate` automatically after processing completes.
This runs `medarc-eval winrate` automatically after processing completes when the config contains a `winrate:` section.

## Example Workflows

Expand Down Expand Up @@ -180,18 +255,65 @@ medarc-eval process
# env_index.json tracks what's already processed
```

Incremental skipping only reuses an existing parquet when its footer metadata `source_runs` still matches the newly selected run ids and the existing row count still matches `env_index.json`.

### Replace Existing Outputs

Rebuild existing outputs for specific models or datasets without using `--clean`:

```bash
# Rebuild every processed dataset for one model
medarc-eval process --replace-model gpt-4o

# Rebuild every model for one dataset
medarc-eval process --replace-env medqa

# Rebuild only the intersection
medarc-eval process --replace-model gpt-4o --replace-env medqa
```

When both flags are present, processing only rebuilds outputs that match both filters.

## Troubleshooting

### "No runs found"

Check that:
1. `--runs-dir` points to the correct location
2. Runs have completed (check `run_manifest.json` status)
3. Use `--process-incomplete` if runs are still in progress
2. Runs have completed (check `run_manifest.json` `jobs[*].status`)
3. Use `--status pending` or `--status running` to include non-completed jobs

### Missing data in output

By default, only jobs with `completed` status are included. Use `--process-incomplete` to include partial results.
By default, only jobs with `completed` status are included. In addition, `--max-results-missing-pct` fails if a selected latest job record is missing more than 2.5% of its expected `results.jsonl` rows, using manifest job fields:

- `row_count`
- `num_examples`
- `rollouts_per_example`

The gate is per selected record, not per whole run manifest. If the latest selected run for a model/dataset is too incomplete, processing fails fast instead of silently falling back to an older run. Records with unknown expected rows or unknown `row_count` are not gated.

Use `--max-results-missing-pct 100` to disable the gate, or pass explicit `--status` values to include other statuses.

### Integrity-check failures for existing parquet files

If processing stops with an error like:

```text
Existing processed output ... has N parquet rows but env_index.json records M.
```

the local processed snapshot is inconsistent. Fix it by rebuilding the affected output:

```bash
medarc-eval process --replace-model gpt-4o --replace-env medqa
```

Or rebuild everything:

```bash
medarc-eval process --clean --yes
```

## Next Steps

Expand Down
Loading
Loading