MedARC-AI · warner-benjamin · Apr 27, 2026 · Feb 27, 2026 · Feb 28, 2026 · Feb 28, 2026
diff --git a/docs/medarc-eval-process.md b/docs/medarc-eval-process.md
@@ -5,7 +5,7 @@ Convert raw benchmark outputs into analysis-ready parquet files. This step prepa
 ## Quick Start
 
 ```bash
-# Process all completed runs (uses defaults)
+# Process all completed jobs (uses defaults)
 medarc-eval process
 
 # Specify directories explicitly
@@ -17,33 +17,35 @@ medarc-eval process --dry-run
 
 ## What Processing Does
 
-1. **Discovers** completed jobs in `runs/raw/`
+1. **Discovers** jobs in `runs/raw/` and filters by manifest status (default: `completed`)
 2. **Extracts** results from each job's output files
-3. **Normalizes** data into a consistent schema
-4. **Writes** parquet files organized by environment and model
+3. **Normalizes** data into a fixed output schema
+4. **Writes** parquet files organized by model and environment
 5. **Creates** an index (`env_index.json`) for downstream tools
 
 ### Output Structure
 
 ```
 runs/processed/
 ├── env_index.json              # Dataset inventory for winrate/analysis
-├── medqa/
-│   ├── gpt-4o.parquet
-│   └── gpt-4o-mini.parquet
-├── pubmedqa/
-│   ├── gpt-4o.parquet
-│   └── gpt-4o-mini.parquet
+├── gpt-4o/
+│   ├── medqa.parquet
+│   └── pubmedqa.parquet
+├── gpt-4o-mini/
+│   ├── medqa.parquet
+│   └── pubmedqa.parquet
 └── ...
 ```
 
+On-disk model and env path components are slugified, so filenames may not exactly match raw ids.
+
 ## Common Options
 
 | Flag | Description | Default |
 |------|-------------|---------|
 | `--runs-dir PATH` | Directory containing raw runs | `runs/raw` |
 | `--output-dir PATH` | Where to write processed files | `runs/processed` |
-| `--max-workers N` | Parallel processing threads | 4 |
+| `--max-workers N` | Parallel worker processes | 4 |
 | `--dry-run` | Show what would be processed | - |
 | `--yes` | Skip confirmation prompts | - |
 | `--exclude-dataset NAME` | Skip processing specific datasets/env ids (repeatable) | - |
@@ -53,16 +55,35 @@ runs/processed/
 
 ### By Completion Status
 
-By default, only completed jobs are processed:
+By default, `medarc-eval process` only selects jobs whose manifest status is `completed`.
 
-```bash
-# Include incomplete runs
-medarc-eval process --process-incomplete
+Note: successful jobs are written to `run_manifest.json` with `status: completed`.
 
-# Filter by specific status
+To override that default, pass one or more explicit status filters:
+
+```bash
 medarc-eval process --status completed --status failed
 ```
 
+You can also gate partially complete outputs by missing `results.jsonl` rows:
+
+```bash
+# Default tolerance is 2.5 percent missing
+medarc-eval process --max-results-missing-pct 2.5
+
+# Effectively disable the gate
+medarc-eval process --max-results-missing-pct 100
+```
+
+This gate uses manifest job metadata only:
+
+- `expected_rows = num_examples * rollouts_per_example`
+- `observed_rows = row_count`
+
+It is computed per selected job record and enforced only on the latest selected run for each processed model/environment output. It does not use manifest `summary.completed` / `summary.total`, and it does not fall back to older runs if the latest one is too incomplete.
+
+Selected records with missing `results.jsonl` fail processing immediately.
+
 ### Latest Runs Only
 
 When multiple runs exist for the same (model, environment) pair, processing uses the latest by default.
@@ -86,13 +107,19 @@ Store common options in a YAML file:
 ```yaml
 # process-config.yaml
 runs_dir: runs/raw
-output_dir: runs/processed
-max_workers: 8
-process_incomplete: false
-exclude_datasets:
-  - med_dialog
-exclude_models:
-  - deprecated-v1
+
+process:
+  dir: processed
+  max_workers: 8
+  max_results_missing_pct: 2.5
+  exclude_datasets:
+    - med_dialog
+  exclude_models:
+    - deprecated-v1
+
+winrate:
+  enabled: true
+  dir: winrate
 ```
 
 ```bash
@@ -101,14 +128,44 @@ medarc-eval process --config process-config.yaml
 
 CLI flags override config values.
 
+Supported config schema for `medarc-eval process`:
+
+- Top-level `runs_dir`: raw run root.
+- Top-level `process:`: process-specific defaults.
+- Optional top-level `winrate:`: embedded post-process winrate step.
+- Optional top-level `hf:`: shared HF settings. For embedded winrate uploads, use `hf.winrate_dir`.
+
+Path shortcuts:
+
+- `process.dir` is shorthand for `process.output_dir`, resolved relative to the parent of `runs_dir`.
+- `winrate.dir` is shorthand for the embedded winrate output directory, resolved under the processed output dir.
+
+Example:
+
+```yaml
+runs_dir: runs/raw
+
+process:
+  dir: processed
+  max_workers: 8
+
+winrate:
+  dir: scorecards
+
+hf:
+  repo: your-org/medical-benchmarks
+  winrate_dir: scorecards/latest
+```
+
 ## Hugging Face Integration
 
 Sync processed datasets to/from the Hugging Face Hub:
 
 ```yaml
 # process-config.yaml
 runs_dir: runs/raw
-output_dir: runs/processed
+process:
+  dir: processed
 
 hf:
   repo: your-org/medical-benchmarks
@@ -117,6 +174,8 @@ hf:
   private: true
 ```
 
+`hf.token` accepts either a literal token string or an environment reference like `$HF_TOKEN` / `${HF_TOKEN}`.
+
 ### Pull Before Processing
 
 ```bash
@@ -128,8 +187,24 @@ medarc-eval process --hf-repo your-org/data --hf-pull-policy pull
 
 # Start fresh (ignore remote)
 medarc-eval process --hf-repo your-org/data --hf-pull-policy clean
+
+# Resume a previously failed HF upload without pulling or cleaning
+medarc-eval process --hf-repo your-org/data --hf-pull-policy continue-upload
 ```
 
+`prompt` only prompts when the local processed dir is already non-empty. If the output dir is empty, process pulls the HF baseline immediately.
+
+When `prompt` is used with a non-empty local processed dir, the menu may show:
+
+- `pull`: download missing baseline data without deleting local files
+- `clean`: redownload everything after deleting local files
+- `upload`: keep local processed outputs and resume/upload pending HF artifacts
+
+`upload` is shown only when local parquet files appear to be missing remotely or have a different remote `lfs.sha256`. Recovery uploads the union of:
+
+- parquet files that were already pending before the current run started
+- files touched by the current process run, including `env_index.json` and `dataset_infos.json` when rewritten
+
 ### Push After Processing
 
 When `--hf-repo` is set, processed files are automatically uploaded after completion.
@@ -139,10 +214,10 @@ When `--hf-repo` is set, processed files are automatically uploaded after comple
 Process and compute win rates in one step:
 
 ```bash
-medarc-eval process --winrate winrate-config.yaml
+medarc-eval process --config process-config.yaml
 ```
 
-This runs `medarc-eval winrate` automatically after processing completes.
+This runs `medarc-eval winrate` automatically after processing completes when the config contains a `winrate:` section.
 
 ## Example Workflows
 
@@ -180,18 +255,65 @@ medarc-eval process
 # env_index.json tracks what's already processed
 ```
 
+Incremental skipping only reuses an existing parquet when its footer metadata `source_runs` still matches the newly selected run ids and the existing row count still matches `env_index.json`.
+
+### Replace Existing Outputs
+
+Rebuild existing outputs for specific models or datasets without using `--clean`:
+
+```bash
+# Rebuild every processed dataset for one model
+medarc-eval process --replace-model gpt-4o
+
+# Rebuild every model for one dataset
+medarc-eval process --replace-env medqa
+
+# Rebuild only the intersection
+medarc-eval process --replace-model gpt-4o --replace-env medqa
+```
+
+When both flags are present, processing only rebuilds outputs that match both filters.
+
 ## Troubleshooting
 
 ### "No runs found"
 
 Check that:
 1. `--runs-dir` points to the correct location
-2. Runs have completed (check `run_manifest.json` status)
-3. Use `--process-incomplete` if runs are still in progress
+2. Runs have completed (check `run_manifest.json` `jobs[*].status`)
+3. Use `--status pending` or `--status running` to include non-completed jobs
 
 ### Missing data in output
 
-By default, only jobs with `completed` status are included. Use `--process-incomplete` to include partial results.
+By default, only jobs with `completed` status are included. In addition, `--max-results-missing-pct` fails if a selected latest job record is missing more than 2.5% of its expected `results.jsonl` rows, using manifest job fields:
+
+- `row_count`
+- `num_examples`
+- `rollouts_per_example`
+
+The gate is per selected record, not per whole run manifest. If the latest selected run for a model/dataset is too incomplete, processing fails fast instead of silently falling back to an older run. Records with unknown expected rows or unknown `row_count` are not gated.
+
+Use `--max-results-missing-pct 100` to disable the gate, or pass explicit `--status` values to include other statuses.
+
+### Integrity-check failures for existing parquet files
+
+If processing stops with an error like:
+
+```text
+Existing processed output ... has N parquet rows but env_index.json records M.
+```
+
+the local processed snapshot is inconsistent. Fix it by rebuilding the affected output:
+
+```bash
+medarc-eval process --replace-model gpt-4o --replace-env medqa
+```
+
+Or rebuild everything:
+
+```bash
+medarc-eval process --clean --yes
+```
 
 ## Next Steps