diff --git a/docs/medarc-eval-process.md b/docs/medarc-eval-process.md
index 7e3ca3c6..1b57c7ba 100644
--- a/docs/medarc-eval-process.md
+++ b/docs/medarc-eval-process.md
@@ -5,7 +5,7 @@ Convert raw benchmark outputs into analysis-ready parquet files. This step prepa
 ## Quick Start
 
 ```bash
-# Process all completed runs (uses defaults)
+# Process all completed jobs (uses defaults)
 medarc-eval process
 
 # Specify directories explicitly
@@ -17,10 +17,10 @@ medarc-eval process --dry-run
 
 ## What Processing Does
 
-1. **Discovers** completed jobs in `runs/raw/`
+1. **Discovers** jobs in `runs/raw/` and filters by manifest status (default: `completed`)
 2. **Extracts** results from each job's output files
-3. **Normalizes** data into a consistent schema
-4. **Writes** parquet files organized by environment and model
+3. **Normalizes** data into a fixed output schema
+4. **Writes** parquet files organized by model and environment
 5. **Creates** an index (`env_index.json`) for downstream tools
 
 ### Output Structure
@@ -28,22 +28,24 @@ medarc-eval process --dry-run
 ```
 runs/processed/
 ├── env_index.json              # Dataset inventory for winrate/analysis
-├── medqa/
-│   ├── gpt-4o.parquet
-│   └── gpt-4o-mini.parquet
-├── pubmedqa/
-│   ├── gpt-4o.parquet
-│   └── gpt-4o-mini.parquet
+├── gpt-4o/
+│   ├── medqa.parquet
+│   └── pubmedqa.parquet
+├── gpt-4o-mini/
+│   ├── medqa.parquet
+│   └── pubmedqa.parquet
 └── ...
 ```
 
+On-disk model and env path components are slugified, so filenames may not exactly match raw ids.
+
 ## Common Options
 
 | Flag | Description | Default |
 |------|-------------|---------|
 | `--runs-dir PATH` | Directory containing raw runs | `runs/raw` |
 | `--output-dir PATH` | Where to write processed files | `runs/processed` |
-| `--max-workers N` | Parallel processing threads | 4 |
+| `--max-workers N` | Parallel worker processes | 4 |
 | `--dry-run` | Show what would be processed | - |
 | `--yes` | Skip confirmation prompts | - |
 | `--exclude-dataset NAME` | Skip processing specific datasets/env ids (repeatable) | - |
@@ -53,16 +55,35 @@ runs/processed/
 
 ### By Completion Status
 
-By default, only completed jobs are processed:
+By default, `medarc-eval process` only selects jobs whose manifest status is `completed`.
 
-```bash
-# Include incomplete runs
-medarc-eval process --process-incomplete
+Note: successful jobs are written to `run_manifest.json` with `status: completed`.
 
-# Filter by specific status
+To override that default, pass one or more explicit status filters:
+
+```bash
 medarc-eval process --status completed --status failed
 ```
 
+You can also gate partially complete outputs by missing `results.jsonl` rows:
+
+```bash
+# Default tolerance is 2.5 percent missing
+medarc-eval process --max-results-missing-pct 2.5
+
+# Effectively disable the gate
+medarc-eval process --max-results-missing-pct 100
+```
+
+This gate uses manifest job metadata only:
+
+- `expected_rows = num_examples * rollouts_per_example`
+- `observed_rows = row_count`
+
+It is computed per selected job record and enforced only on the latest selected run for each processed model/environment output. It does not use manifest `summary.completed` / `summary.total`, and it does not fall back to older runs if the latest one is too incomplete.
+
+Selected records with missing `results.jsonl` fail processing immediately.
+
 ### Latest Runs Only
 
 When multiple runs exist for the same (model, environment) pair, processing uses the latest by default.
@@ -86,13 +107,19 @@ Store common options in a YAML file:
 ```yaml
 # process-config.yaml
 runs_dir: runs/raw
-output_dir: runs/processed
-max_workers: 8
-process_incomplete: false
-exclude_datasets:
-  - med_dialog
-exclude_models:
-  - deprecated-v1
+
+process:
+  dir: processed
+  max_workers: 8
+  max_results_missing_pct: 2.5
+  exclude_datasets:
+    - med_dialog
+  exclude_models:
+    - deprecated-v1
+
+winrate:
+  enabled: true
+  dir: winrate
 ```
 
 ```bash
@@ -101,6 +128,35 @@ medarc-eval process --config process-config.yaml
 
 CLI flags override config values.
 
+Supported config schema for `medarc-eval process`:
+
+- Top-level `runs_dir`: raw run root.
+- Top-level `process:`: process-specific defaults.
+- Optional top-level `winrate:`: embedded post-process winrate step.
+- Optional top-level `hf:`: shared HF settings. For embedded winrate uploads, use `hf.winrate_dir`.
+
+Path shortcuts:
+
+- `process.dir` is shorthand for `process.output_dir`, resolved relative to the parent of `runs_dir`.
+- `winrate.dir` is shorthand for the embedded winrate output directory, resolved under the processed output dir.
+
+Example:
+
+```yaml
+runs_dir: runs/raw
+
+process:
+  dir: processed
+  max_workers: 8
+
+winrate:
+  dir: scorecards
+
+hf:
+  repo: your-org/medical-benchmarks
+  winrate_dir: scorecards/latest
+```
+
 ## Hugging Face Integration
 
 Sync processed datasets to/from the Hugging Face Hub:
@@ -108,7 +164,8 @@ Sync processed datasets to/from the Hugging Face Hub:
 ```yaml
 # process-config.yaml
 runs_dir: runs/raw
-output_dir: runs/processed
+process:
+  dir: processed
 
 hf:
   repo: your-org/medical-benchmarks
@@ -117,6 +174,8 @@ hf:
   private: true
 ```
 
+`hf.token` accepts either a literal token string or an environment reference like `$HF_TOKEN` / `${HF_TOKEN}`.
+
 ### Pull Before Processing
 
 ```bash
@@ -128,8 +187,24 @@ medarc-eval process --hf-repo your-org/data --hf-pull-policy pull
 
 # Start fresh (ignore remote)
 medarc-eval process --hf-repo your-org/data --hf-pull-policy clean
+
+# Resume a previously failed HF upload without pulling or cleaning
+medarc-eval process --hf-repo your-org/data --hf-pull-policy continue-upload
 ```
 
+`prompt` only prompts when the local processed dir is already non-empty. If the output dir is empty, process pulls the HF baseline immediately.
+
+When `prompt` is used with a non-empty local processed dir, the menu may show:
+
+- `pull`: download missing baseline data without deleting local files
+- `clean`: redownload everything after deleting local files
+- `upload`: keep local processed outputs and resume/upload pending HF artifacts
+
+`upload` is shown only when local parquet files appear to be missing remotely or have a different remote `lfs.sha256`. Recovery uploads the union of:
+
+- parquet files that were already pending before the current run started
+- files touched by the current process run, including `env_index.json` and `dataset_infos.json` when rewritten
+
 ### Push After Processing
 
 When `--hf-repo` is set, processed files are automatically uploaded after completion.
@@ -139,10 +214,10 @@ When `--hf-repo` is set, processed files are automatically uploaded after comple
 Process and compute win rates in one step:
 
 ```bash
-medarc-eval process --winrate winrate-config.yaml
+medarc-eval process --config process-config.yaml
 ```
 
-This runs `medarc-eval winrate` automatically after processing completes.
+This runs `medarc-eval winrate` automatically after processing completes when the config contains a `winrate:` section.
 
 ## Example Workflows
 
@@ -180,18 +255,65 @@ medarc-eval process
 # env_index.json tracks what's already processed
 ```
 
+Incremental skipping only reuses an existing parquet when its footer metadata `source_runs` still matches the newly selected run ids and the existing row count still matches `env_index.json`.
+
+### Replace Existing Outputs
+
+Rebuild existing outputs for specific models or datasets without using `--clean`:
+
+```bash
+# Rebuild every processed dataset for one model
+medarc-eval process --replace-model gpt-4o
+
+# Rebuild every model for one dataset
+medarc-eval process --replace-env medqa
+
+# Rebuild only the intersection
+medarc-eval process --replace-model gpt-4o --replace-env medqa
+```
+
+When both flags are present, processing only rebuilds outputs that match both filters.
+
 ## Troubleshooting
 
 ### "No runs found"
 
 Check that:
 1. `--runs-dir` points to the correct location
-2. Runs have completed (check `run_manifest.json` status)
-3. Use `--process-incomplete` if runs are still in progress
+2. Runs have completed (check `run_manifest.json` `jobs[*].status`)
+3. Use `--status pending` or `--status running` to include non-completed jobs
 
 ### Missing data in output
 
-By default, only jobs with `completed` status are included. Use `--process-incomplete` to include partial results.
+By default, only jobs with `completed` status are included. In addition, `--max-results-missing-pct` fails if a selected latest job record is missing more than 2.5% of its expected `results.jsonl` rows, using manifest job fields:
+
+- `row_count`
+- `num_examples`
+- `rollouts_per_example`
+
+The gate is per selected record, not per whole run manifest. If the latest selected run for a model/dataset is too incomplete, processing fails fast instead of silently falling back to an older run. Records with unknown expected rows or unknown `row_count` are not gated.
+
+Use `--max-results-missing-pct 100` to disable the gate, or pass explicit `--status` values to include other statuses.
+
+### Integrity-check failures for existing parquet files
+
+If processing stops with an error like:
+
+```text
+Existing processed output ... has N parquet rows but env_index.json records M.
+```
+
+the local processed snapshot is inconsistent. Fix it by rebuilding the affected output:
+
+```bash
+medarc-eval process --replace-model gpt-4o --replace-env medqa
+```
+
+Or rebuild everything:
+
+```bash
+medarc-eval process --clean --yes
+```
 
 ## Next Steps
 
diff --git a/docs/medarc-eval-winrate.md b/docs/medarc-eval-winrate.md
index d1f50e99..47c28f92 100644
--- a/docs/medarc-eval-winrate.md
+++ b/docs/medarc-eval-winrate.md
@@ -12,7 +12,7 @@ medarc-eval winrate
 medarc-eval winrate --list-models
 
 # Specify directories
-medarc-eval winrate --processed-dir runs/processed --output-dir runs/winrate
+medarc-eval winrate --processed-dir runs/processed --output-dir runs/processed/winrate
 ```
 
 ## Prerequisites
@@ -27,30 +27,35 @@ medarc-eval process
 ## How Win Rates Work
 
 For each pair of models (A, B) on each benchmark:
-1. Find questions both models answered
-2. Compare scores on each question
-3. Count: A wins, B wins, ties
-4. Win rate = (A wins + 0.5 × ties) / total
+1. Average rollouts per `(example_id, model_id)`
+2. Compare questions where at least one model has a reward
+3. If one side is missing, fill it according to `--missing-policy` (`neg-inf` or `zero`)
+4. Count: A wins, B wins, ties
+5. Win rate = (A wins + 0.5 × ties) / total used questions
 
 The final win rate aggregates across all benchmarks using configurable weighting.
 
+Winrate also emits a missingness summary so partial dataset coverage is visible. The report counts missing
+`(dataset, model)` pairs after rollout averaging, including both absent rows and null reward values.
+
 ## Output Files
 
 ```
-runs/winrate/
-├── winrates-2026-01-14T12-00-00.json    # Timestamped results
-├── winrates-2026-01-14T12-00-00.csv     # Spreadsheet-friendly
+runs/processed/winrate/
+├── winrates-20260114T120000Z.json       # Timestamped results
+├── winrates-20260114T120000Z.csv        # Spreadsheet-friendly
 ├── latest.json                           # Always points to newest
 └── latest.csv
 ```
 
+If you pass `--output /path/to/file.json`, winrate writes only that JSON file and skips `latest.json` plus all CSV outputs.
+
 ### Output Format
 
 The JSON output includes:
 - Per-model aggregate win rates
-- Pairwise comparison matrices
-- Per-benchmark breakdowns
-- Computation metadata
+- Per-opponent `vs` breakdowns
+- Per-dataset average rewards and question counts
 
 ## Common Options
 
@@ -92,33 +97,40 @@ The JSON output includes:
 | `--partial-datasets strict` | When `--include-model` is set, drop datasets missing any included model |
 | `--partial-datasets include` | When `--include-model` is set, keep datasets and treat missing models as all-missing |
 
+`--partial-datasets include` is usually paired with `--dataset-coverage per-model`. With the default `all-models` coverage, datasets missing any required model are still dropped later.
+
 ## Using a Config File
 
 ```yaml
-# winrate-config.yaml
-processed_dir: runs/processed
-output_dir: runs/winrate
-
-# Calculation settings
-missing_policy: neg-inf
-epsilon: 1.0e-9
-min_common: 10
-weight_policy: ln
-
-# Model filtering
-exclude_model:
-  - baseline-model
-  - deprecated-v1
-
-# Dataset filtering
-exclude_datasets:
-  - med_dialog
+# process-config.yaml
+runs_dir: runs/raw
+
+process:
+  dir: processed
+
+winrate:
+  dir: winrate
+  missing_policy: neg-inf
+  epsilon: 1.0e-9
+  min_common: 10
+  weight_policy: ln
+  exclude_model:
+    - baseline-model
+    - deprecated-v1
+  exclude_datasets:
+    - med_dialog
 ```
 
 ```bash
-medarc-eval winrate --config winrate-config.yaml
+medarc-eval winrate --config process-config.yaml
 ```
 
+Supported config schema for `medarc-eval winrate`:
+
+- Top-level `process:` can provide `dir` or `output_dir`; this becomes the default `processed_dir`.
+- Top-level `winrate:` provides winrate-specific defaults.
+- Top-level `hf:` provides shared HF settings. Use `hf.winrate_dir` to control where winrate artifacts upload inside the repo.
+
 ## Example Workflows
 
 ### Compare Specific Models
@@ -161,7 +173,7 @@ medarc-eval winrate --weight-policy ln
 
 ```bash
 medarc-eval winrate \
-  --hf-processed-repo your-org/processed-benchmarks \
+  --hf-repo your-org/processed-benchmarks \
   --hf-processed-pull \
   --hf-token $HF_TOKEN
 ```
@@ -170,7 +182,8 @@ medarc-eval winrate \
 
 ```bash
 medarc-eval winrate \
-  --hf-winrate-repo your-org/winrate-results \
+  --hf-repo your-org/processed-benchmarks \
+  --hf-winrate-dir winrate \
   --hf-token $HF_TOKEN \
   --hf-private
 ```
@@ -178,52 +191,81 @@ medarc-eval winrate \
 ### Full Config with HF
 
 ```yaml
-# winrate-config.yaml
-processed_dir: runs/processed
-output_dir: runs/winrate
+# process-config.yaml
+runs_dir: runs/raw
+
+process:
+  dir: processed
 
-missing_policy: neg-inf
-weight_policy: ln
+winrate:
+  dir: winrate
+  missing_policy: neg-inf
+  weight_policy: ln
 
 hf:
-  repo: your-org/processed-data          # Pull processed from here
-  winrate_repo: your-org/winrate-results # Upload results here
+  repo: your-org/processed-data # Pull processed from here; upload winrate here
+  winrate_dir: winrate          # Subdirectory in repo for winrate artifacts (default: winrate)
   branch: main
   token: ${HF_TOKEN}
   private: true
 ```
 
+`hf.token` accepts either a literal token string or an environment reference like `$HF_TOKEN` / `${HF_TOKEN}`.
+
+`hf.winrate_dir` and `--hf-winrate-dir` both set the path inside the HF repo where `latest.json`, `latest.csv`, and timestamped winrate outputs are uploaded.
+
 ## Interpreting Results
 
 ### Win Rate Table (CSV)
 
-| model | win_rate | vs_gpt-4o | vs_gpt-4o-mini | vs_claude |
-|-------|----------|-----------|----------------|-----------|
-| gpt-4o | 0.72 | - | 0.85 | 0.58 |
-| gpt-4o-mini | 0.45 | 0.15 | - | 0.32 |
-| claude-3-5-sonnet | 0.68 | 0.42 | 0.68 | - |
+| model | weighted_winrate | simple_winrate | medqa | pubmedqa | num_datasets |
+|-------|------------------|----------------|-------|-----------|--------------|
+| gpt-4o | 0.72 | 0.70 | 0.84 | 0.77 | 2 |
+| gpt-4o-mini | 0.45 | 0.43 | 0.61 | 0.39 | 2 |
 
-- **win_rate**: Aggregate win rate across all models
-- **vs_X columns**: Pairwise win rate against model X
-- Values > 0.5 mean the row model wins more often
+- **weighted_winrate** / **simple_winrate**: Aggregate mean winrate across retained datasets
+- Dataset columns: Average reward on that dataset, not pairwise winrate columns
+- `num_datasets`: Number of datasets retained for that model after filtering/coverage rules
 
 ### JSON Structure
 
 ```json
 {
-  "models": ["gpt-4o", "gpt-4o-mini", "claude-3-5-sonnet"],
-  "aggregate_winrates": {
-    "gpt-4o": 0.72,
-    "gpt-4o-mini": 0.45,
-    "claude-3-5-sonnet": 0.68
-  },
-  "pairwise": {
+  "models": {
     "gpt-4o": {
-      "gpt-4o-mini": {"win_rate": 0.85, "wins": 850, "losses": 150, "ties": 0},
-      "claude-3-5-sonnet": {"win_rate": 0.58, ...}
+      "mean_winrate": {
+        "simple_mean": 0.72,
+        "weighted_mean": 0.74,
+        "n_datasets": 2
+      },
+      "vs": {
+        "gpt-4o-mini": {
+          "mean_winrate": {
+            "simple_mean": 0.85,
+            "weighted_mean": 0.84
+          },
+          "per_dataset": {
+            "medqa": 0.90,
+            "pubmedqa": 0.80
+          },
+          "n_datasets": 2
+        }
+      },
+      "avg_reward_per_dataset": {
+        "medqa": 0.84,
+        "pubmedqa": 0.77
+      }
+    }
+  },
+  "datasets": {
+    "medqa": {
+      "avg_reward_per_model": {
+        "gpt-4o": 0.84,
+        "gpt-4o-mini": 0.61
+      },
+      "n_questions": 1273
     }
   },
-  "per_benchmark": { ... }
 }
 ```
 
@@ -239,3 +281,4 @@ hf:
 - Check `--min-common` isn't filtering out comparisons
 - Review `--missing-policy` (use `neg-inf` to penalize missing answers)
 - Verify models were evaluated on the same benchmark variants
+- If using `--partial-datasets include`, also consider `--dataset-coverage per-model`
diff --git a/docs/medarc-eval.md b/docs/medarc-eval.md
index a9e48a39..395d251f 100644
--- a/docs/medarc-eval.md
+++ b/docs/medarc-eval.md
@@ -27,7 +27,7 @@ medarc-eval winrate
    (bench or single)       (process)               (winrate)
         |                      |                        |
         v                      v                        v
-    runs/raw/           runs/processed/          runs/winrate/
+    runs/raw/           runs/processed/    runs/processed/winrate/
 ```
 
 ## Commands
diff --git a/docs/medarc-verifiers-architecture.md b/docs/medarc-verifiers-architecture.md
index 7eddf092..d9f25cd2 100644
--- a/docs/medarc-verifiers-architecture.md
+++ b/docs/medarc-verifiers-architecture.md
@@ -16,7 +16,7 @@ At a high level, everything funnels into a three-stage workflow:
 
 1. **Run** evals (single or batch) → `runs/raw/<run_id>/...`
 2. **Process** raw outputs → `runs/processed/<model>/<env>.parquet` + `env_index.json`
-3. **Winrate** on processed outputs → `runs/winrate/*.json` and `*.csv`
+3. **Winrate** on processed outputs → `runs/processed/winrate/*.json` and `*.csv`
 
 ## Important side effects (auto-installed patches)
 
@@ -173,7 +173,7 @@ Entry point: `medarc_verifiers/cli/process/pipeline.py` (via `run_process()`).
    - This suffix-derived rollout index is only used when rollouts are faked this way. Native verifiers rollouts (below) use the per-row JSONL field.
    - `medarc_verifiers/cli/process/rollout.py`
 4. **Load rows from `results.jsonl`**:
-   - Drops large fields (`prompt`, `completion`) by default.
+   - Always drops large fields (`prompt`, `completion`).
    - Allows selecting extra per-env columns into a JSON-encoded `extras` column.
    - If the JSONL provides a per-row `rollout_index` (native verifiers multi-rollout runs), it is treated as authoritative and preserved.
    - If `rollout_index` is missing but the JSONL contains multiple rows per `example_id`, computes a data-driven `rollout_index` based on occurrence count.
@@ -184,7 +184,8 @@ Entry point: `medarc_verifiers/cli/process/pipeline.py` (via `run_process()`).
    - When aggregating fake rollouts (manifest env ids include rollout suffixes), ensures every row has a `rollout_index` (derived from the suffix if missing) and normalizes indices to `0..K-1` within the dataset.
    - When aggregating native verifiers rollouts (no rollout suffixes), preserves `rollout_index` values as provided by `results.jsonl` (no normalization).
 6. **Write Parquet**:
-   - Output path is `<processed_dir>/<model_id>/<env_id>.parquet`.
+   - Output path is `<processed_dir>/<slug(model_id)>/<slug(env_id)>.parquet`.
+   - Output columns are restricted to a fixed allowlist schema for downstream compatibility.
    - Adds exporter metadata under a Parquet schema metadata key.
    - Writes `env_index.json` (v2) and `dataset_infos.json` for HF datasets UX.
    - `medarc_verifiers/cli/process/writer.py`, `medarc_verifiers/cli/process/env_index.py`
@@ -200,13 +201,15 @@ Processing can use `env_index.json` to do incremental updates (delta processing)
 
 Docs: `docs/medarc-eval-winrate.md`.
 
-`medarc-eval winrate` reads dataset inventory from `env_index.json`, then computes pairwise model comparisons.
+`medarc-eval winrate` reads dataset inventory from `env_index.json`, averages rollouts per `(example_id, model_id)`, then computes pairwise model comparisons.
 
 - Dataset discovery via `env_index.json`: `medarc_verifiers/cli/winrate/runner.py`
 - Core math + weighting policies: `medarc_verifiers/cli/winrate/api.py`
 - Outputs:
   - timestamped `winrates-<timestamp>.json` and `.csv`
   - `latest.json` and `latest.csv`
+  - JSON shape is model-centric: top-level `models` and `datasets`
+  - CSV contains aggregate winrates plus per-dataset average rewards, not pairwise `vs_*` columns
 
 ## Shared building blocks used by environments
 
diff --git a/medarc_verifiers/cli/_constants.py b/medarc_verifiers/cli/_constants.py
index 41a840dd..a466e47b 100644
--- a/medarc_verifiers/cli/_constants.py
+++ b/medarc_verifiers/cli/_constants.py
@@ -20,4 +20,4 @@
 DEFAULT_ENV_CONFIG_ROOT = Path("configs") / "envs"
 DEFAULT_RUNS_RAW_DIR = Path("runs") / "raw"
 DEFAULT_PROCESSED_DIR = Path("runs") / "processed"
-DEFAULT_WINRATE_DIR = Path("runs") / "winrate"
+DEFAULT_WINRATE_DIR = DEFAULT_PROCESSED_DIR / "winrate"
diff --git a/medarc_verifiers/cli/_manifest_tools.py b/medarc_verifiers/cli/_manifest_tools.py
index 5ba9effc..836fd9d2 100644
--- a/medarc_verifiers/cli/_manifest_tools.py
+++ b/medarc_verifiers/cli/_manifest_tools.py
@@ -2,14 +2,16 @@
 
 from __future__ import annotations
 
+import os
 import json
 import logging
+import sys
+from concurrent.futures import ThreadPoolExecutor, as_completed
 from dataclasses import dataclass
 from pathlib import Path
-from typing import Sequence
+from typing import Any, Mapping, Sequence
 
 from medarc_verifiers.cli._manifest import MANIFEST_FILENAME, RunManifestModel, SUPPORTED_MANIFEST_VERSIONS
-from medarc_verifiers.cli.utils.shared import count_jsonl_rows
 
 logger = logging.getLogger(__name__)
 
@@ -41,90 +43,179 @@ def validate_manifests_in_runs(runs_dir: Path | str, *, strict: bool = False) ->
     if not runs_path.exists():
         return ManifestValidationResult(manifests_checked=0, jobs_checked=0, issues=[])
 
-    for run_dir in sorted(path for path in runs_path.iterdir() if path.is_dir()):
-        manifest_path = run_dir / MANIFEST_FILENAME
-        if not manifest_path.exists():
-            continue
-        manifests_checked += 1
-        try:
-            payload = json.loads(manifest_path.read_text(encoding="utf-8"))
-        except Exception as exc:  # noqa: BLE001
-            issues.append(
+    run_dirs = sorted(path for path in runs_path.iterdir() if path.is_dir())
+    logger.info("Scanning manifests under %s...", runs_path)
+
+    manifest_run_dirs = [run_dir for run_dir in run_dirs if (run_dir / MANIFEST_FILENAME).exists()]
+    if not manifest_run_dirs:
+        return ManifestValidationResult(manifests_checked=0, jobs_checked=0, issues=[])
+
+    max_workers = min(len(manifest_run_dirs), max(1, (os.cpu_count() or 4) * 4))
+    if max_workers <= 1:
+        results = [_validate_run_dir(run_dir, strict=strict) for run_dir in manifest_run_dirs]
+    else:
+        results = list(_validate_run_dirs_parallel(manifest_run_dirs, strict=strict, max_workers=max_workers))
+
+    for result in results:
+        manifests_checked += result.manifests_checked
+        jobs_checked += result.jobs_checked
+        issues.extend(result.issues)
+
+    issues.sort(key=lambda item: (item.run_id, item.job_id, item.kind, item.message))
+    return ManifestValidationResult(manifests_checked=manifests_checked, jobs_checked=jobs_checked, issues=issues)
+
+
+def _validate_run_dirs_parallel(
+    run_dirs: Sequence[Path],
+    *,
+    strict: bool,
+    max_workers: int,
+) -> list[ManifestValidationResult]:
+    results: list[ManifestValidationResult] = []
+    progress, task_id = _create_manifest_scan_progress(len(run_dirs))
+    executor: ThreadPoolExecutor | None = None
+    futures = []
+    try:
+        executor = ThreadPoolExecutor(max_workers=max_workers)
+        futures = [executor.submit(_validate_run_dir, run_dir, strict=strict) for run_dir in run_dirs]
+        if progress is not None and task_id is not None:
+            with progress:
+                for future in as_completed(futures):
+                    results.append(future.result())
+                    progress.update(task_id, advance=1)
+        else:
+            for future in as_completed(futures):
+                results.append(future.result())
+    except KeyboardInterrupt:
+        logger.warning("Manifest scanning interrupted; cancelling validation workers.")
+        for future in futures:
+            future.cancel()
+        if executor is not None:
+            executor.shutdown(wait=False, cancel_futures=True)
+            executor = None
+        raise
+    finally:
+        if executor is not None:
+            executor.shutdown(wait=True, cancel_futures=False)
+    return results
+
+
+def _create_manifest_scan_progress(total: int) -> tuple[object | None, object | None]:
+    if total <= 0 or not sys.stderr.isatty():
+        return None, None
+    try:
+        from rich.progress import BarColumn, Progress, SpinnerColumn, TaskProgressColumn, TextColumn, TimeElapsedColumn
+
+        progress = Progress(
+            SpinnerColumn(),
+            TextColumn("[progress.description]{task.description}"),
+            BarColumn(),
+            TaskProgressColumn(),
+            TimeElapsedColumn(),
+            transient=True,
+        )
+        task_id = progress.add_task("Scanning manifests", total=total)
+        return progress, task_id
+    except Exception:
+        return None, None
+
+
+def _validate_run_dir(run_dir: Path, *, strict: bool) -> ManifestValidationResult:
+    issues: list[ManifestValidationIssue] = []
+    manifest_path = run_dir / MANIFEST_FILENAME
+    if not manifest_path.exists():
+        return ManifestValidationResult(manifests_checked=0, jobs_checked=0, issues=[])
+
+    try:
+        payload = json.loads(manifest_path.read_text(encoding="utf-8"))
+    except Exception as exc:  # noqa: BLE001
+        return ManifestValidationResult(
+            manifests_checked=1,
+            jobs_checked=0,
+            issues=[
                 ManifestValidationIssue(
                     run_id=run_dir.name,
                     job_id="",
                     kind="error",
                     message=f"Failed to parse manifest: {exc}",
                 )
-            )
-            continue
+            ],
+        )
 
-        version = payload.get("version")
-        if version not in SUPPORTED_MANIFEST_VERSIONS:
-            issues.append(
+    version = payload.get("version")
+    if version not in SUPPORTED_MANIFEST_VERSIONS:
+        return ManifestValidationResult(
+            manifests_checked=1,
+            jobs_checked=0,
+            issues=[
                 ManifestValidationIssue(
                     run_id=run_dir.name,
                     job_id="",
                     kind="error",
                     message=f"Unsupported manifest version: {version}",
                 )
+            ],
+        )
+
+    model = RunManifestModel.model_validate(payload)
+    artifacts_root = str(getattr(model, "artifacts_root", ".") or ".")
+    jobs_checked = 0
+
+    for entry in model.jobs:
+        jobs_checked += 1
+        results_path, metadata_path, used_fallback = _resolve_job_artifact_paths(
+            run_dir=run_dir,
+            artifacts_root=artifacts_root,
+            job_id=entry.job_id,
+            results_relpath=entry.results_relpath,
+            metadata_relpath=entry.metadata_relpath,
+        )
+        if used_fallback:
+            issues.append(
+                ManifestValidationIssue(
+                    run_id=model.run_id,
+                    job_id=entry.job_id,
+                    kind="warning",
+                    message="Manifest artifact path missing; fallback to run-relative job directory would be used.",
+                )
             )
-            continue
-        model = RunManifestModel.model_validate(payload)
-        artifacts_root = str(getattr(model, "artifacts_root", ".") or ".")
-
-        for entry in model.jobs:
-            jobs_checked += 1
-            results_path, metadata_path, used_fallback = _resolve_job_artifact_paths(
-                run_dir=run_dir,
-                artifacts_root=artifacts_root,
-                job_id=entry.job_id,
-                results_relpath=entry.results_relpath,
-                metadata_relpath=entry.metadata_relpath,
-            )
-            if used_fallback:
-                issues.append(
-                    ManifestValidationIssue(
-                        run_id=model.run_id,
-                        job_id=entry.job_id,
-                        kind="warning",
-                        message="Manifest artifact path missing; fallback to run-relative job directory would be used.",
-                    )
+        if not results_path.exists():
+            kind = "error" if strict else "warning"
+            issues.append(
+                ManifestValidationIssue(
+                    run_id=model.run_id,
+                    job_id=entry.job_id,
+                    kind=kind,
+                    message=f"Missing results.jsonl at {results_path}",
                 )
-            if not results_path.exists():
+            )
+        if results_path.exists():
+            for message in _quick_validate_results_jsonl(
+                results_path,
+                num_examples=entry.num_examples,
+                rollouts_per_example=entry.rollouts_per_example,
+            ):
                 kind = "error" if strict else "warning"
                 issues.append(
                     ManifestValidationIssue(
                         run_id=model.run_id,
                         job_id=entry.job_id,
                         kind=kind,
-                        message=f"Missing results.jsonl at {results_path}",
+                        message=message,
                     )
                 )
-            if entry.row_count is not None and results_path.exists():
-                row_count = count_jsonl_rows(results_path)
-                if row_count is not None and int(row_count) != int(entry.row_count):
-                    kind = "error" if strict else "warning"
-                    issues.append(
-                        ManifestValidationIssue(
-                            run_id=model.run_id,
-                            job_id=entry.job_id,
-                            kind=kind,
-                            message=f"row_count mismatch: manifest={entry.row_count} actual={row_count}",
-                        )
-                    )
-            # metadata is optional; only flag when declared explicitly in v3.
-            if entry.metadata_relpath and not metadata_path.exists():
-                kind = "error" if strict else "warning"
-                issues.append(
-                    ManifestValidationIssue(
-                        run_id=model.run_id,
-                        job_id=entry.job_id,
-                        kind=kind,
-                        message=f"Missing metadata.json at {metadata_path}",
-                    )
+        if entry.metadata_relpath and not metadata_path.exists():
+            kind = "error" if strict else "warning"
+            issues.append(
+                ManifestValidationIssue(
+                    run_id=model.run_id,
+                    job_id=entry.job_id,
+                    kind=kind,
+                    message=f"Missing metadata.json at {metadata_path}",
                 )
-    return ManifestValidationResult(manifests_checked=manifests_checked, jobs_checked=jobs_checked, issues=issues)
+            )
+
+    return ManifestValidationResult(manifests_checked=1, jobs_checked=jobs_checked, issues=issues)
 
 
 def _resolve_job_artifact_paths(
@@ -153,6 +244,132 @@ def _resolve_job_artifact_paths(
     return results_path, metadata_path, used_fallback
 
 
+def _quick_validate_results_jsonl(
+    path: Path,
+    *,
+    num_examples: int | None,
+    rollouts_per_example: int | None,
+) -> list[str]:
+    first_line = _read_first_nonempty_line(path)
+    last_line = _read_last_nonempty_line(path)
+    if first_line is None or last_line is None:
+        return [f"results.jsonl at {path} is empty"]
+
+    issues: list[str] = []
+    first_payload = _decode_probe_line(first_line, path=path, position="first", issues=issues)
+    last_payload = _decode_probe_line(last_line, path=path, position="last", issues=issues)
+    if first_payload is None or last_payload is None:
+        return issues
+
+    for position, payload in (("first", first_payload), ("last", last_payload)):
+        if "example_id" not in payload:
+            issues.append(f"{position} JSONL row in {path} is missing example_id")
+    _validate_rollout_index(
+        first_payload,
+        path=path,
+        position="first",
+        rollouts_per_example=rollouts_per_example,
+        issues=issues,
+    )
+    _validate_rollout_index(
+        last_payload,
+        path=path,
+        position="last",
+        rollouts_per_example=rollouts_per_example,
+        issues=issues,
+    )
+
+    return issues
+
+
+def _decode_probe_line(
+    raw_line: str,
+    *,
+    path: Path,
+    position: str,
+    issues: list[str],
+) -> Mapping[str, Any] | None:
+    try:
+        payload = json.loads(raw_line)
+    except json.JSONDecodeError as exc:
+        issues.append(f"failed to parse {position} JSONL row in {path}: {exc.msg}")
+        return None
+    if not isinstance(payload, Mapping):
+        issues.append(f"{position} JSONL row in {path} is not a JSON object")
+        return None
+    return payload
+
+
+def _read_first_nonempty_line(path: Path) -> str | None:
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            candidate = line.strip()
+            if candidate:
+                return candidate
+    return None
+
+
+def _read_last_nonempty_line(path: Path) -> str | None:
+    with path.open("rb") as handle:
+        handle.seek(0, os.SEEK_END)
+        file_size = handle.tell()
+        if file_size <= 0:
+            return None
+
+        chunk_size = 8192
+        buffer = b""
+        position = file_size
+        while position > 0:
+            read_size = min(chunk_size, position)
+            position -= read_size
+            handle.seek(position)
+            buffer = handle.read(read_size) + buffer
+            lines = buffer.splitlines()
+            for raw_line in reversed(lines):
+                candidate = raw_line.strip()
+                if candidate:
+                    return candidate.decode("utf-8")
+        return None
+
+
+def _validate_rollout_index(
+    payload: Mapping[str, Any],
+    *,
+    path: Path,
+    position: str,
+    rollouts_per_example: int | None,
+    issues: list[str],
+) -> None:
+    rollout_index = _coerce_int(payload.get("rollout_index"))
+    if rollout_index is None:
+        return
+    if rollout_index < 0:
+        issues.append(f"{position} JSONL row in {path} has negative rollout_index={payload.get('rollout_index')!r}")
+        return
+    if rollouts_per_example and rollout_index >= rollouts_per_example:
+        issues.append(
+            f"{position} JSONL row in {path} has out-of-range rollout_index={payload.get('rollout_index')!r}; "
+            f"expected < {rollouts_per_example}"
+        )
+
+
+def _coerce_int(value: Any) -> int | None:
+    if value is None or isinstance(value, bool):
+        return None
+    if isinstance(value, int):
+        return value
+    if isinstance(value, float):
+        if value.is_integer():
+            return int(value)
+        return None
+    if isinstance(value, str):
+        try:
+            return int(value.strip())
+        except ValueError:
+            return None
+    return None
+
+
 def format_validation_issues(issues: Sequence[ManifestValidationIssue]) -> list[str]:
     lines: list[str] = []
     for issue in issues:
diff --git a/medarc_verifiers/cli/hf/__init__.py b/medarc_verifiers/cli/hf/__init__.py
index 11e0a6aa..47009eb4 100644
--- a/medarc_verifiers/cli/hf/__init__.py
+++ b/medarc_verifiers/cli/hf/__init__.py
@@ -3,6 +3,8 @@
 from .sync import (  # noqa: F401
     HFSyncConfig,
     HFSyncSummary,
+    collect_changed_output_files,
+    compute_pending_parquet_uploads,
     download_hf_repo,
     sync_files_to_hub,
     sync_to_hub,
@@ -11,6 +13,8 @@
 __all__ = [
     "HFSyncConfig",
     "HFSyncSummary",
+    "collect_changed_output_files",
+    "compute_pending_parquet_uploads",
     "sync_files_to_hub",
     "sync_to_hub",
     "download_hf_repo",
diff --git a/medarc_verifiers/cli/hf/sync.py b/medarc_verifiers/cli/hf/sync.py
index 9f462f9d..44db7314 100644
--- a/medarc_verifiers/cli/hf/sync.py
+++ b/medarc_verifiers/cli/hf/sync.py
@@ -2,12 +2,15 @@
 
 from __future__ import annotations
 
+import hashlib
 import logging
 import tempfile
 import time
 from dataclasses import dataclass
 from pathlib import Path
-from typing import TYPE_CHECKING, Callable, Sequence
+from typing import TYPE_CHECKING, Any, Callable, Iterable, Sequence
+
+from medarc_verifiers.utils.pathing import resolve_under
 
 if TYPE_CHECKING:
     from medarc_verifiers.cli.process.writer import EnvWriteSummary
@@ -60,6 +63,29 @@ def _is_repo_not_found_error(exc: BaseException) -> bool:
     return False
 
 
+def _status_code_from_exc(exc: BaseException) -> int | None:
+    response = getattr(exc, "response", None)
+    status_code = getattr(response, "status_code", None)
+    if status_code is None:
+        status_code = getattr(exc, "status_code", None)
+    try:
+        return int(status_code) if status_code is not None else None
+    except Exception:
+        return None
+
+
+def _is_transient_hf_error(exc: BaseException) -> bool:
+    status_code = _status_code_from_exc(exc)
+    if status_code == 429 or (status_code is not None and 500 <= status_code < 600):
+        return True
+    try:
+        import httpx  # type: ignore[import-not-found]
+
+        return isinstance(exc, (httpx.TimeoutException, httpx.TransportError))
+    except Exception:
+        return False
+
+
 def _confirm_create_repo(
     *,
     repo_id: str,
@@ -153,6 +179,181 @@ class HFSyncSummary:
     files: Sequence[str]
 
 
+def _local_sha256(path: Path) -> str:
+    digest = hashlib.sha256()
+    with path.open("rb") as handle:
+        for chunk in iter(lambda: handle.read(1024 * 1024), b""):
+            digest.update(chunk)
+    return digest.hexdigest()
+
+
+def _repo_tree_entry_path(entry: Any) -> str | None:
+    for attr in ("path", "rfilename"):
+        value = getattr(entry, attr, None)
+        if isinstance(value, str) and value.strip():
+            return Path(value).as_posix()
+    if isinstance(entry, dict):
+        value = entry.get("path") or entry.get("rfilename")
+        if isinstance(value, str) and value.strip():
+            return Path(value).as_posix()
+    return None
+
+
+def _repo_tree_entry_lfs_sha256(entry: Any) -> str | None:
+    lfs = getattr(entry, "lfs", None)
+    if lfs is None and isinstance(entry, dict):
+        lfs = entry.get("lfs")
+    if isinstance(lfs, dict):
+        sha256 = lfs.get("sha256")
+        return str(sha256) if sha256 else None
+    sha256 = getattr(lfs, "sha256", None)
+    return str(sha256) if sha256 else None
+
+
+def _normalize_output_files(output_dir: Path, files: Iterable[str | Path]) -> list[str]:
+    normalized: list[str] = []
+    for path in files:
+        candidate = Path(path)
+        if candidate.is_absolute():
+            try:
+                rel_path = candidate.relative_to(output_dir)
+            except ValueError:
+                continue
+        else:
+            # Accept caller inputs like "runs/processed/foo.parquet" when output_dir is also relative.
+            output_parts = output_dir.parts
+            if output_parts and candidate.parts[: len(output_parts)] == output_parts:
+                try:
+                    rel_path = candidate.relative_to(output_dir)
+                except ValueError:
+                    continue
+            else:
+                rel_path = candidate
+        rel_text = rel_path.as_posix()
+        if rel_text:
+            normalized.append(rel_text)
+    return sorted(set(normalized))
+
+
+def _prepare_upload_file_entries(output_dir: Path, files: Sequence[str | Path]) -> list[tuple[str, Path]]:
+    output_dir = output_dir.resolve()
+    prepared: list[tuple[str, Path]] = []
+    seen: set[str] = set()
+    for path in files:
+        candidate = Path(path)
+        raw_text = candidate.as_posix()
+        if not raw_text:
+            continue
+        if candidate.is_absolute():
+            try:
+                rel_path = candidate.resolve().relative_to(output_dir).as_posix()
+            except ValueError as exc:
+                raise ValueError(f"Upload file path must be under output_dir: {candidate}") from exc
+        else:
+            resolved = resolve_under(output_dir, raw_text)
+            if resolved is None:
+                raise ValueError(f"Upload file path must be relative to output_dir without traversal: {raw_text!r}")
+            try:
+                rel_path = resolved.resolve().relative_to(output_dir).as_posix()
+            except ValueError as exc:
+                raise ValueError(f"Upload file path resolves outside output_dir: {raw_text!r}") from exc
+        local_path = (output_dir / rel_path).resolve()
+        try:
+            local_path.relative_to(output_dir)
+        except ValueError as exc:
+            raise ValueError(f"Upload file path resolves outside output_dir: {raw_text!r}") from exc
+        if rel_path in seen:
+            continue
+        prepared.append((rel_path, local_path))
+        seen.add(rel_path)
+    return prepared
+
+
+def collect_changed_output_files(
+    env_summaries: Sequence[EnvWriteSummary],
+    *,
+    output_dir: Path,
+    metadata_paths: Sequence[Path] | None = None,
+) -> list[str]:
+    changed_paths = {summary.output_path for summary in env_summaries if summary.changed}
+    if metadata_paths:
+        for path in metadata_paths:
+            candidate = Path(path)
+            if not candidate.is_absolute():
+                output_parts = output_dir.parts
+                if output_parts and candidate.parts[: len(output_parts)] != output_parts:
+                    candidate = output_dir / candidate
+            changed_paths.add(candidate)
+    return _normalize_output_files(output_dir, changed_paths)
+
+
+def _collect_changed_output_files(
+    env_summaries: Sequence[EnvWriteSummary],
+    *,
+    output_dir: Path,
+    metadata_paths: Sequence[Path] | None = None,
+) -> list[str]:
+    return collect_changed_output_files(env_summaries, output_dir=output_dir, metadata_paths=metadata_paths)
+
+
+def compute_pending_parquet_uploads(
+    output_dir: Path,
+    repo_id: str,
+    branch: str | None,
+    token: str | None,
+) -> set[str]:
+    """Return local parquet paths that are missing remotely or differ from remote lfs.sha256."""
+    output_dir = Path(output_dir)
+    local_parquets = sorted(path for path in output_dir.rglob("*.parquet") if path.is_file())
+    if not local_parquets:
+        return set()
+
+    try:
+        from huggingface_hub import HfApi  # type: ignore[import-not-found]
+    except Exception as exc:  # noqa: BLE001
+        raise ImportError("huggingface_hub is required for HF upload recovery.") from exc
+
+    api = HfApi(token=token)
+    list_kwargs = {
+        "repo_id": repo_id,
+        "repo_type": "dataset",
+        "revision": branch,
+        "recursive": True,
+        "expand": True,
+    }
+    try:
+        try:
+            tree_entries = list(api.list_repo_tree(**list_kwargs))
+        except TypeError as exc:
+            if "expand" not in str(exc):
+                raise
+            list_kwargs.pop("expand", None)
+            tree_entries = list(api.list_repo_tree(**list_kwargs))
+    except Exception as exc:  # noqa: BLE001
+        if _is_repo_not_found_error(exc):
+            tree_entries = []
+        else:
+            raise
+
+    remote_parquets: dict[str, str | None] = {}
+    for entry in tree_entries:
+        rel_path = _repo_tree_entry_path(entry)
+        if not rel_path or not rel_path.endswith(".parquet"):
+            continue
+        remote_parquets[rel_path] = _repo_tree_entry_lfs_sha256(entry)
+
+    pending: set[str] = set()
+    for parquet_path in local_parquets:
+        rel_path = parquet_path.relative_to(output_dir).as_posix()
+        if rel_path not in remote_parquets:
+            pending.add(rel_path)
+            continue
+        remote_sha256 = remote_parquets[rel_path]
+        if remote_sha256 is None or remote_sha256 != _local_sha256(parquet_path):
+            pending.add(rel_path)
+    return pending
+
+
 def sync_files_to_hub(
     *,
     repo_id: str,
@@ -166,25 +367,27 @@ def sync_files_to_hub(
     request_timeout_s: float | None = None,
     retries: int = 3,
     max_files_per_commit: int | None = None,
+    path_in_repo_prefix: str | None = None,
     is_tty: bool = False,
     assume_yes: bool = False,
     prompt_func: Callable[[str], str] | None = None,
-) -> None:
-    """Upload explicit file paths from output_dir to a HF dataset repo."""
+) -> bool:
+    """Upload explicit file paths from output_dir to a HF dataset repo.
+
+    Returns False only when upload is skipped because repo creation was declined.
+    """
     if not repo_id:
         logger.debug("HF sync skipped: no repo_id provided.")
-        return
-    file_list = []
-    for path in files:
-        rel_path = Path(path).as_posix() if not isinstance(path, str) else Path(path).as_posix()
-        if rel_path:
-            file_list.append(rel_path)
+        return True
+    output_dir = Path(output_dir)
+    prepared_files = _prepare_upload_file_entries(output_dir, files)
+    file_list = [rel_path for rel_path, _ in prepared_files]
     if not file_list:
         logger.debug("HF sync skipped: no files provided.")
-        return
+        return True
     if dry_run:
         logger.debug("HF sync dry-run; skipping push.")
-        return
+        return True
 
     try:
         from huggingface_hub import CommitOperationAdd, HfApi  # type: ignore[import-not-found]
@@ -195,6 +398,9 @@ def sync_files_to_hub(
         _configure_hf_http_timeout(float(request_timeout_s))
 
     api = HfApi(token=token)
+    repo_prefix = _normalize_repo_path_prefix(path_in_repo_prefix)
+
+    file_map = dict(prepared_files)
 
     if max_files_per_commit is None or max_files_per_commit <= 0:
         batches = [file_list]
@@ -203,11 +409,12 @@ def sync_files_to_hub(
             file_list[index : index + max_files_per_commit] for index in range(0, len(file_list), max_files_per_commit)
         ]
 
-    output_dir = Path(output_dir)
-
     for batch_index, batch_files in enumerate(batches, start=1):
         operations = [
-            CommitOperationAdd(path_in_repo=rel_path, path_or_fileobj=str(output_dir / rel_path))
+            CommitOperationAdd(
+                path_in_repo=_join_repo_path(repo_prefix, rel_path),
+                path_or_fileobj=str(file_map[rel_path]),
+            )
             for rel_path in batch_files
         ]
         commit_message = message
@@ -234,9 +441,11 @@ def sync_files_to_hub(
                         prompt_func=prompt_func,
                     )
                     if not should_create:
-                        raise RuntimeError(
-                            f"HF dataset repo '{repo_id}' not found. Create it on the Hub or re-run with --yes to allow creation."
-                        ) from exc
+                        logger.warning(
+                            "HF dataset repo '%s' not found; skipping upload because repo creation was declined.",
+                            repo_id,
+                        )
+                        return False
                     api.create_repo(
                         repo_id=repo_id,
                         repo_type="dataset",
@@ -245,13 +454,7 @@ def sync_files_to_hub(
                     )
                     # Retry the commit immediately after repo creation.
                     continue
-                try:
-                    import httpx  # type: ignore[import-not-found]
-
-                    is_retryable = isinstance(exc, (httpx.TimeoutException, httpx.TransportError))
-                except Exception:
-                    is_retryable = False
-                if not is_retryable or attempt >= int(retries):
+                if not _is_transient_hf_error(exc) or attempt >= int(retries):
                     raise
                 delay = _sleep_backoff_seconds(attempt)
                 logger.warning(
@@ -262,6 +465,27 @@ def sync_files_to_hub(
                     delay,
                 )
                 time.sleep(delay)
+    return True
+
+
+def _normalize_repo_path_prefix(value: str | None) -> str | None:
+    if value is None:
+        return None
+    raw = str(value).strip().replace("\\", "/").strip("/")
+    if not raw:
+        return None
+    candidate = resolve_under(Path("."), raw)
+    if candidate is None:
+        raise ValueError(f"Invalid path_in_repo_prefix: {value!r}")
+    normalized = candidate.as_posix().lstrip("./")
+    return normalized or None
+
+
+def _join_repo_path(prefix: str | None, rel_path: str) -> str:
+    rel = rel_path.strip().replace("\\", "/").lstrip("/")
+    if not prefix:
+        return rel
+    return f"{prefix}/{rel}" if rel else prefix
 
 
 def sync_to_hub(
@@ -270,6 +494,7 @@ def sync_to_hub(
     *,
     output_dir: Path,
     metadata_paths: Sequence[Path] | None = None,
+    files: Sequence[str | Path] | None = None,
     is_tty: bool = False,
     assume_yes: bool = False,
     prompt_func: Callable[[str], str] | None = None,
@@ -278,37 +503,27 @@ def sync_to_hub(
     if not config.repo_id:
         logger.debug("HF sync skipped: no repo_id provided.")
         return None
-    if not env_summaries:
-        logger.debug("HF sync skipped: no environment summaries available.")
-        return None
-    if all(summary.dry_run for summary in env_summaries):
-        logger.debug("HF sync skipped: only dry-run summaries available.")
+    if config.dry_run:
+        logger.debug("HF sync dry-run; skipping summary generation and upload.")
         return None
 
+    output_dir = Path(output_dir)
     changed = [summary for summary in env_summaries if summary.changed]
-    if not changed:
-        logger.debug("HF sync skipped: no changed outputs.")
-        return None
+    if files is None:
+        if not env_summaries:
+            logger.debug("HF sync skipped: no environment summaries available.")
+            return None
+        if all(summary.dry_run for summary in env_summaries):
+            logger.debug("HF sync skipped: only dry-run summaries available.")
+            return None
+        files = collect_changed_output_files(env_summaries, output_dir=output_dir, metadata_paths=metadata_paths)
+    else:
+        files = _normalize_output_files(output_dir, files)
 
-    output_dir = Path(output_dir)
-    changed_paths = {summary.output_path for summary in changed}
-    if metadata_paths:
-        for path in metadata_paths:
-            candidate = Path(path)
-            if not candidate.is_absolute():
-                output_parts = output_dir.parts
-                if output_parts and candidate.parts[: len(output_parts)] != output_parts:
-                    candidate = output_dir / candidate
-            changed_paths.add(candidate)
+    if not files:
+        logger.debug("HF sync skipped: no files selected for upload.")
+        return None
 
-    files = []
-    for path in changed_paths:
-        try:
-            rel_path = path.relative_to(output_dir)
-        except ValueError:
-            continue
-        files.append(rel_path.as_posix())
-    files = sorted(set(files))
     summary = HFSyncSummary(
         repo_id=config.repo_id,
         strategy="file",
@@ -318,7 +533,7 @@ def sync_to_hub(
     )
 
     message = f"Update {summary.total_files} file(s) from medarc-eval process"
-    sync_files_to_hub(
+    uploaded = sync_files_to_hub(
         repo_id=config.repo_id,
         output_dir=output_dir,
         files=files,
@@ -334,6 +549,8 @@ def sync_to_hub(
         assume_yes=assume_yes,
         prompt_func=prompt_func,
     )
+    if not uploaded:
+        return None
     return summary
 
 
@@ -383,6 +600,8 @@ def download_hf_repo(
 __all__ = [
     "HFSyncSummary",
     "HFSyncConfig",
+    "collect_changed_output_files",
+    "compute_pending_parquet_uploads",
     "sync_files_to_hub",
     "sync_to_hub",
 ]
diff --git a/medarc_verifiers/cli/main.py b/medarc_verifiers/cli/main.py
index 72d0a20e..97ca6e50 100644
--- a/medarc_verifiers/cli/main.py
+++ b/medarc_verifiers/cli/main.py
@@ -4,6 +4,7 @@
 
 import argparse
 import logging
+import os
 import sys
 from pathlib import Path
 from textwrap import dedent
@@ -25,7 +26,6 @@
     DEFAULT_ENV_DIR,
     DEFAULT_PROCESSED_DIR,
     DEFAULT_RUNS_RAW_DIR,
-    DEFAULT_WINRATE_DIR,
     PROCESS_COMMAND,
     WINRATE_COMMAND,
 )
@@ -33,11 +33,10 @@
 from medarc_verifiers.cli._job_executor import ExecutorSettings, JobExecutionResult, execute_jobs
 from medarc_verifiers.cli._manifest import MANIFEST_FILENAME, ManifestJobEntry, RunManifest, compute_snapshot_checksum
 from medarc_verifiers.cli._manifest_planner import ManifestPlanner
-from medarc_verifiers.cli._manifest_tools import format_validation_issues, validate_manifests_in_runs
 from medarc_verifiers.cli._schemas import EnvironmentConfigSchema, EnvironmentExportConfig
 from medarc_verifiers.cli._single_run import run_single_mode
 from medarc_verifiers.cli.hf import HFSyncConfig, sync_files_to_hub
-from medarc_verifiers.cli.process import ProcessOptions, ProcessResult, run_process
+from medarc_verifiers.cli.process import PROCESS_DEFAULT_STATUS_FILTER, ProcessOptions, ProcessResult, run_process
 from medarc_verifiers.cli.utils.config_io import load_mapping_file
 from medarc_verifiers.cli.utils.overrides import build_cli_override
 from medarc_verifiers.cli.utils.shared import (
@@ -47,6 +46,7 @@
     slugify,
     validate_simple_name,
 )
+from medarc_verifiers.utils.pathing import resolve_under
 from medarc_verifiers.cli.winrate import (
     WinrateConfig,
     _resolve_source,
@@ -287,29 +287,33 @@ def build_process_parser() -> argparse.ArgumentParser:
     parser.add_argument("--processed-at", default=None, help="Override processed_at timestamp (ISO8601).")
     parser.add_argument("--dry-run", action="store_true", default=None, help="Plan processing without writing outputs.")
     parser.add_argument(
-        "--validate-manifest",
-        action=argparse.BooleanOptionalAction,
+        "--replace-model",
+        action="append",
         default=None,
-        help="Validate run manifests before processing (default: enabled).",
+        help="Rebuild existing processed outputs for these model ids (repeatable; comma-separated values allowed).",
     )
     parser.add_argument(
-        "--strict-manifest",
-        action="store_true",
+        "--replace-env",
+        action="append",
         default=None,
-        help="Treat manifest validation problems as errors.",
+        help="Rebuild existing processed outputs for these env ids (repeatable; comma-separated values allowed).",
     )
     parser.add_argument(
-        "--process-incomplete",
-        dest="process_incomplete",
-        action="store_true",
+        "--max-results-missing-pct",
+        type=float,
         default=None,
-        help="Include runs where run_manifest.json summary has completed < total.",
+        help=(
+            "Fail if a selected latest job record is missing more than this percentage of expected results.jsonl rows "
+            "based on manifest job fields (row_count, num_examples, rollouts_per_example). "
+            "Computed per selected job record and enforced only on the latest selected run; does not use "
+            "manifest summary.completed/summary.total or fall back to older runs (default: 2.5)."
+        ),
     )
     parser.add_argument(
         "--winrate",
         type=Path,
         default=None,
-        help="Run winrate after processing using the provided winrate config file.",
+        help="Run winrate after processing using the provided config file. If omitted, an embedded winrate section in --config is used.",
     )
     parser.add_argument(
         "--max-workers",
@@ -321,7 +325,7 @@ def build_process_parser() -> argparse.ArgumentParser:
     parser.add_argument("--hf-repo", default=None, help="Hugging Face repo id for dataset sync.")
     parser.add_argument(
         "--hf-pull-policy",
-        choices=("prompt", "pull", "clean"),
+        choices=("prompt", "pull", "clean", "continue-upload"),
         default=None,
         help="Baseline policy when output dir is non-empty in HF mode.",
     )
@@ -376,7 +380,7 @@ def build_winrate_parser() -> argparse.ArgumentParser:
         "--output-dir",
         type=Path,
         default=None,
-        help=f"Directory to store winrate outputs (default: {DEFAULT_WINRATE_DIR}).",
+        help="Directory to store winrate outputs (default: <processed-dir>/winrate).",
     )
     parser.add_argument(
         "--output",
@@ -465,7 +469,7 @@ def build_winrate_parser() -> argparse.ArgumentParser:
             "per-model uses the legacy behavior where each model may be averaged over a different dataset set."
         ),
     )
-    parser.add_argument("--hf-processed-repo", help="Hugging Face repo id for processed dataset download.")
+    parser.add_argument("--hf-repo", help="Hugging Face repo id used for processed download and winrate upload.")
     parser.add_argument(
         "--hf-processed-pull",
         action="store_true",
@@ -474,7 +478,11 @@ def build_winrate_parser() -> argparse.ArgumentParser:
     )
     parser.add_argument("--hf-branch", help="Target HF branch or revision for processed download.")
     parser.add_argument("--hf-token", help="Auth token for HF operations.")
-    parser.add_argument("--hf-winrate-repo", help="Hugging Face repo id for winrate artifact upload.")
+    parser.add_argument(
+        "--hf-winrate-dir",
+        default=None,
+        help="Path under the HF repo where winrate artifacts are uploaded (default: winrate).",
+    )
     parser.add_argument(
         "--hf-private",
         action=argparse.BooleanOptionalAction,
@@ -561,33 +569,61 @@ def _run_batch_mode(argv: Sequence[str]) -> int:
 
 
 def _run_process_mode(argv: Sequence[str]) -> int:
+    parser, args = _resolve_process_args(argv)
+    winrate_args = _resolve_embedded_winrate(args, parser=parser)
+
+    try:
+        env_export_map = _load_env_export_map(args.env_config_root)
+    except Exception as exc:  # noqa: BLE001
+        logger.warning("Failed to load environment export configs: %s", exc)
+        env_export_map = {}
+
+    options = _build_process_options(args)
+
+    try:
+        result = run_process(options, env_export_map=env_export_map)
+    except Exception as exc:  # noqa: BLE001
+        logger.exception("Process pipeline failed: %s", exc)
+        return 1
+
+    _log_process_result(result)
+    return _run_process_post_steps(args, parser=parser, options=options, winrate_args=winrate_args)
+
+
+def _resolve_process_args(argv: Sequence[str]) -> tuple[argparse.ArgumentParser, argparse.Namespace]:
     parser = build_process_parser()
     args = parser.parse_args(argv)
 
     if args.config:
         _load_and_apply_config(args, args.config, mode="process", parser=parser)
     _finalize_config_args(args, mode="process")
+    _validate_process_args(args, argv=argv, parser=parser)
+    return parser, args
+
+
+def _validate_process_args(
+    args: argparse.Namespace,
+    *,
+    argv: Sequence[str],
+    parser: argparse.ArgumentParser,
+) -> None:
+    for flag, attr in (("--replace-model", "replace_model"), ("--replace-env", "replace_env")):
+        if _option_was_provided(argv, flag) and not getattr(args, attr, None):
+            parser.error(f"{flag} requires at least one non-empty value.")
     try:
         if args.exclude_dataset:
             normalize_dataset_ids(args.exclude_dataset, label="process exclude dataset")
         if args.exclude_model:
             normalize_model_ids(args.exclude_model, label="process exclude model")
+        if args.max_results_missing_pct is not None:
+            value = float(args.max_results_missing_pct)
+            if value < 0:
+                parser.error("--max-results-missing-pct must be non-negative.")
     except ValueError as exc:
         parser.error(str(exc))
-    winrate_args: argparse.Namespace | None = None
-    if args.winrate:
-        winrate_path = Path(args.winrate).expanduser()
-        if not winrate_path.exists():
-            parser.error(f"Winrate config path '{winrate_path}' does not exist.")
-        args.winrate = winrate_path
-        winrate_args = _build_winrate_args_from_config(winrate_path, parser=parser)
 
-    try:
-        env_export_map = _load_env_export_map(args.env_config_root)
-    except Exception as exc:  # noqa: BLE001
-        logger.warning("Failed to load environment export configs: %s", exc)
-        env_export_map = {}
 
+def _build_process_options(args: argparse.Namespace) -> ProcessOptions:
     hf_config = HFSyncConfig.from_cli(
         repo=args.hf_repo,
         branch=args.hf_branch,
@@ -598,16 +634,18 @@ def _run_process_mode(argv: Sequence[str]) -> int:
         retries=args.hf_retries,
         max_files_per_commit=args.hf_max_files_per_commit,
     )
-
+    status_values = list(args.status or [])
+    status_filter = tuple(status_values) if status_values else PROCESS_DEFAULT_STATUS_FILTER
+    max_results_missing_pct = float(args.max_results_missing_pct) if args.max_results_missing_pct is not None else 2.5
     processed_with_args = {
-        "status": args.status or [],
+        "status": list(status_filter),
+        "max_results_missing_pct": max_results_missing_pct,
         "exclude_datasets": args.exclude_dataset or [],
         "exclude_models": args.exclude_model or [],
+        "replace_models": args.replace_model or [],
+        "replace_envs": args.replace_env or [],
         "dry_run": bool(args.dry_run),
         "clean": bool(args.clean),
-        "validate_manifest": bool(args.validate_manifest),
-        "strict_manifest": bool(args.strict_manifest),
-        "only_complete_runs": not bool(args.process_incomplete),
         "hf_repo": args.hf_repo,
         "hf_pull_policy": args.hf_pull_policy,
         "hf_request_timeout": args.hf_request_timeout,
@@ -615,16 +653,17 @@ def _run_process_mode(argv: Sequence[str]) -> int:
         "hf_max_files_per_commit": args.hf_max_files_per_commit,
         "max_workers": args.max_workers,
     }
-
-    options = ProcessOptions(
+    return ProcessOptions(
         runs_dir=args.runs_dir,
         output_dir=args.output_dir,
         exclude_datasets=tuple(args.exclude_dataset or ()),
         exclude_models=tuple(args.exclude_model or ()),
+        replace_models=tuple(args.replace_model or ()),
+        replace_envs=tuple(args.replace_env or ()),
         processed_at=args.processed_at,
         processed_with_args=processed_with_args,
-        status_filter=args.status or (),
-        only_complete_runs=not bool(args.process_incomplete),
+        status_filter=status_filter,
+        max_results_missing_pct=max_results_missing_pct,
         dry_run=bool(args.dry_run),
         clean=bool(args.clean),
         assume_yes=bool(args.yes),
@@ -633,80 +672,94 @@ def _run_process_mode(argv: Sequence[str]) -> int:
         max_workers=args.max_workers,
     )
 
-    if args.validate_manifest:
-        validation = validate_manifests_in_runs(options.runs_dir, strict=bool(args.strict_manifest))
-        for line in format_validation_issues(validation.issues):
-            if line.startswith("[ERROR]"):
-                logger.error("%s", line)
-            else:
-                logger.warning("%s", line)
-        logger.info(
-            "Manifest preflight: checked %d manifest(s), %d job(s), %d issue(s).",
-            validation.manifests_checked,
-            validation.jobs_checked,
-            len(validation.issues),
-        )
-        if validation.has_errors:
-            logger.error("Manifest validation failed in strict mode; aborting process.")
-            return 1
 
+def _resolve_embedded_winrate(
+    args: argparse.Namespace,
+    *,
+    parser: argparse.ArgumentParser,
+) -> argparse.Namespace | None:
+    embedded_winrate = False
+    if args.config and args.winrate is None:
+        try:
+            embedded_winrate = _config_has_embedded_winrate(Path(args.config).expanduser())
+        except (FileNotFoundError, ValueError) as exc:
+            parser.error(str(exc))
+
+    if args.winrate:
+        winrate_path = Path(args.winrate).expanduser()
+        if not winrate_path.exists():
+            parser.error(f"Winrate config path '{winrate_path}' does not exist.")
+        args.winrate = winrate_path
+        return _build_winrate_args_from_config(winrate_path, parser=parser)
+
+    if embedded_winrate:
+        args.winrate = Path(args.config).expanduser()
+        return _build_winrate_args_from_config(Path(args.config).expanduser(), parser=parser)
+    return None
+
+
+def _run_process_post_steps(
+    args: argparse.Namespace,
+    *,
+    parser: argparse.ArgumentParser,
+    options: ProcessOptions,
+    winrate_args: argparse.Namespace | None,
+) -> int:
+    if not args.winrate:
+        return 0
+    if options.dry_run:
+        logger.info("Skipping winrate post-step for dry-run process.")
+        return 0
+
+    if winrate_args is None:
+        winrate_args = _build_winrate_args_from_config(Path(args.winrate), parser=parser)
+    winrate_args.processed_dir = options.output_dir
+    if not getattr(winrate_args, "_output_dir_explicit", False):
+        winrate_args.output_dir = _default_winrate_output_dir(options.output_dir)
+    winrate_args.hf_repo = None
+    winrate_args.hf_processed_pull = False
+
+    winrate_cfg = WinrateConfig(
+        missing_policy=winrate_args.missing_policy,
+        epsilon=winrate_args.epsilon,
+        min_common=winrate_args.min_common,
+        weight_policy=winrate_args.weight_policy,
+        weight_cap=winrate_args.weight_cap,
+        dataset_coverage=winrate_args.dataset_coverage,
+        include_models=tuple(winrate_args.include_model or ()),
+        exclude_models=tuple(winrate_args.exclude_model or ()),
+        exclude_datasets=tuple(winrate_args.exclude_dataset or ()),
+        partial_datasets=winrate_args.partial_datasets,
+    )
     try:
-        result = run_process(options, env_export_map=env_export_map)
+        winrate_result = run_winrate(
+            processed_dir=options.output_dir,
+            output_dir=winrate_args.output_dir,
+            output_path=winrate_args.output,
+            output_name=winrate_args.output_name,
+            config=winrate_cfg,
+            processed_at=winrate_args.processed_at,
+            hf_config=None,
+            hf_processed_pull=False,
+        )
     except Exception as exc:  # noqa: BLE001
-        logger.exception("Process pipeline failed: %s", exc)
+        logger.exception("Win rate computation failed: %s", exc)
         return 1
 
-    _log_process_result(result)
+    logger.info("Computed win rates for %d dataset(s): %s", len(winrate_result.datasets), winrate_result.output_path)
+    print_winrate_summary_markdown(winrate_result.result)
 
-    if args.winrate:
-        if options.dry_run:
-            logger.info("Skipping winrate post-step for dry-run process.")
-            return 0
-        if winrate_args is None:
-            winrate_args = _build_winrate_args_from_config(Path(args.winrate), parser=parser)
-        winrate_args.processed_dir = options.output_dir
-        winrate_args.hf_processed_repo = None
-        winrate_args.hf_processed_pull = False
-        winrate_cfg = WinrateConfig(
-            missing_policy=winrate_args.missing_policy,
-            epsilon=winrate_args.epsilon,
-            min_common=winrate_args.min_common,
-            weight_policy=winrate_args.weight_policy,
-            weight_cap=winrate_args.weight_cap,
-            dataset_coverage=winrate_args.dataset_coverage,
-            include_models=tuple(winrate_args.include_model or ()),
-            exclude_models=tuple(winrate_args.exclude_model or ()),
-            exclude_datasets=tuple(winrate_args.exclude_dataset or ()),
-            partial_datasets=winrate_args.partial_datasets,
-        )
-        try:
-            winrate_result = run_winrate(
-                processed_dir=options.output_dir,
-                output_dir=winrate_args.output_dir,
-                output_path=winrate_args.output,
-                output_name=winrate_args.output_name,
-                config=winrate_cfg,
-                processed_at=winrate_args.processed_at,
-                hf_config=None,
-                hf_processed_pull=False,
-            )
-        except Exception as exc:  # noqa: BLE001
-            logger.exception("Win rate computation failed: %s", exc)
-            return 1
-        logger.info(
-            "Computed win rates for %d dataset(s): %s", len(winrate_result.datasets), winrate_result.output_path
+    if options.hf_config and options.hf_config.repo_id:
+        _upload_winrate_outputs(
+            output_dir=winrate_args.output_dir,
+            output_paths=winrate_result.output_paths,
+            repo_id=options.hf_config.repo_id,
+            token=options.hf_config.token,
+            branch=options.hf_config.branch,
+            private=bool(options.hf_config.private),
+            winrate_dir=winrate_args.hf_winrate_dir,
+            assume_yes=bool(args.yes),
         )
-        print_winrate_summary_markdown(winrate_result.result)
-        if winrate_args.hf_winrate_repo:
-            _upload_winrate_outputs(
-                output_dir=winrate_args.output_dir,
-                output_paths=winrate_result.output_paths,
-                repo_id=winrate_args.hf_winrate_repo,
-                token=winrate_args.hf_token,
-                private=bool(winrate_args.hf_private),
-                assume_yes=bool(args.yes),
-            )
-
     return 0
 
 
@@ -744,18 +797,199 @@ def _set_if_unset(args: argparse.Namespace, attr: str, value: Any) -> None:
         setattr(args, attr, value)
 
 
+def _resolve_config_string_value(key: str, value: Any) -> str:
+    resolved = str(value)
+    if key != "hf_token":
+        return resolved
+
+    trimmed = resolved.strip()
+    env_var: str | None = None
+    if trimmed.startswith("${") and trimmed.endswith("}") and len(trimmed) > 3:
+        env_var = trimmed[2:-1].strip()
+    elif trimmed.startswith("$") and len(trimmed) > 1:
+        env_var = trimmed[1:].strip()
+
+    if not env_var:
+        return resolved
+
+    env_value = os.getenv(env_var)
+    if env_value is None:
+        raise ValueError(f"Config field 'hf.token' references unset environment variable '{env_var}'.")
+    return env_value
+
+
 def _load_config_payload(path: Path, *, mode: Literal["process", "winrate"]) -> dict[str, Any]:
     label = "Process config" if mode == "process" else "Winrate config"
-    return dict(load_mapping_file(path, label=label))
+    raw_payload = dict(load_mapping_file(path, label=label))
+    if mode == "process":
+        _reject_removed_process_config_keys(raw_payload)
+    return _expand_embedded_pipeline_config(raw_payload, mode=mode)
+
+
+def _reject_removed_process_config_keys(payload: Mapping[str, Any]) -> None:
+    if "max_run_missing_pct" in payload:
+        raise ValueError("Process config field 'max_run_missing_pct' was removed; use 'max_results_missing_pct'.")
+    process_section = payload.get("process")
+    if isinstance(process_section, Mapping) and "max_run_missing_pct" in process_section:
+        raise ValueError(
+            "Process config field 'process.max_run_missing_pct' was removed; use 'process.max_results_missing_pct'."
+        )
+
+
+def _expand_embedded_pipeline_config(payload: dict[str, Any], *, mode: Literal["process", "winrate"]) -> dict[str, Any]:
+    expanded = dict(payload)
+    process_section = payload.get("process")
+    if isinstance(process_section, Mapping):
+        _merge_process_section(expanded, process_section, mode=mode)
+
+    process_output_dir = _resolve_processed_dir_from_payload(expanded, mode=mode)
+
+    winrate_section = payload.get("winrate")
+    if isinstance(winrate_section, Mapping):
+        if mode == "process":
+            expanded.pop("winrate", None)
+        if mode == "winrate":
+            _merge_winrate_section(expanded, winrate_section, process_output_dir=process_output_dir)
+    elif isinstance(winrate_section, bool) and mode == "process":
+        expanded.pop("winrate", None)
+
+    if mode == "winrate" and "processed_dir" not in expanded and process_output_dir is not None:
+        expanded["processed_dir"] = process_output_dir
+
+    return expanded
+
+
+def _merge_process_section(
+    expanded: dict[str, Any],
+    process_section: Mapping[str, Any],
+    *,
+    mode: Literal["process", "winrate"],
+) -> None:
+    resolved = None
+    if "dir" in process_section:
+        resolved = _resolve_process_dir_value(process_section["dir"], runs_dir=expanded.get("runs_dir"))
+        if mode == "process" and "output_dir" not in expanded and resolved is not None:
+            expanded["output_dir"] = resolved
+        if mode == "winrate" and "processed_dir" not in expanded and resolved is not None:
+            expanded["processed_dir"] = resolved
+    if mode == "winrate" and "processed_dir" not in expanded and "output_dir" in process_section:
+        expanded["processed_dir"] = process_section["output_dir"]
+    key_map = {"runs_dir": "runs_dir"}
+    if mode == "process":
+        key_map.update(
+            {
+                "output_dir": "output_dir",
+                "env_config_root": "env_config_root",
+                "processed_at": "processed_at",
+                "status": "status",
+                "exclude_datasets": "exclude_datasets",
+                "exclude_models": "exclude_models",
+                "replace_models": "replace_models",
+                "replace_envs": "replace_envs",
+                "dry_run": "dry_run",
+                "clean": "clean",
+                "yes": "yes",
+                "max_workers": "max_workers",
+                "max_results_missing_pct": "max_results_missing_pct",
+            }
+        )
+    for key, target in key_map.items():
+        if key in process_section and target not in expanded:
+            expanded[target] = process_section[key]
+
+
+def _merge_winrate_section(
+    expanded: dict[str, Any],
+    winrate_section: Mapping[str, Any],
+    *,
+    process_output_dir: Path | None,
+) -> None:
+    if "dir" in winrate_section and "output_dir" not in expanded:
+        resolved = _resolve_winrate_dir_value(winrate_section["dir"], process_output_dir=process_output_dir)
+        if resolved is not None:
+            expanded["output_dir"] = resolved
+    key_map = {
+        "processed_dir": "processed_dir",
+        "output_dir": "output_dir",
+        "output_name": "output_name",
+        "processed_at": "processed_at",
+        "missing_policy": "missing_policy",
+        "epsilon": "epsilon",
+        "min_common": "min_common",
+        "weight_policy": "weight_policy",
+        "weight_cap": "weight_cap",
+        "dataset_coverage": "dataset_coverage",
+        "include_model": "include_models",
+        "include_models": "include_models",
+        "exclude_model": "exclude_models",
+        "exclude_models": "exclude_models",
+        "exclude_dataset": "exclude_datasets",
+        "exclude_datasets": "exclude_datasets",
+        "partial_datasets": "partial_datasets",
+        "hf_processed_pull": "hf_processed_pull",
+        "hf_winrate_dir": "hf_winrate_dir",
+    }
+    for key, target in key_map.items():
+        if key in winrate_section and target not in expanded:
+            expanded[target] = winrate_section[key]
+
+
+def _resolve_processed_dir_from_payload(
+    payload: Mapping[str, Any], *, mode: Literal["process", "winrate"]
+) -> Path | None:
+    if "processed_dir" in payload and payload["processed_dir"] is not None:
+        return Path(str(payload["processed_dir"]))
+    if mode == "process" and "output_dir" in payload and payload["output_dir"] is not None:
+        return Path(str(payload["output_dir"]))
+    process_section = payload.get("process")
+    if isinstance(process_section, Mapping) and "dir" in process_section:
+        return _resolve_process_dir_value(process_section["dir"], runs_dir=payload.get("runs_dir"))
+    return None
+
+
+def _resolve_process_dir_value(value: Any, *, runs_dir: Any | None) -> Path | None:
+    raw = str(value).strip()
+    if not raw:
+        return None
+    candidate = Path(raw)
+    if candidate.is_absolute():
+        return candidate
+    runs_base = Path(str(runs_dir)).parent if runs_dir is not None else DEFAULT_RUNS_RAW_DIR.parent
+    return runs_base / candidate
+
+
+def _resolve_winrate_dir_value(value: Any, *, process_output_dir: Path | None) -> Path | None:
+    raw = str(value).strip()
+    if not raw:
+        return None
+    candidate = Path(raw)
+    if candidate.is_absolute():
+        return candidate
+    base = process_output_dir if process_output_dir is not None else DEFAULT_PROCESSED_DIR
+    return base / candidate
+
+
+def _config_has_embedded_winrate(path: Path) -> bool:
+    payload = dict(load_mapping_file(path, label="Process config"))
+    winrate_payload = payload.get("winrate")
+    if isinstance(winrate_payload, Mapping):
+        return bool(winrate_payload.get("enabled", True))
+    return bool(winrate_payload) if isinstance(winrate_payload, bool) else False
 
 
 def _normalize_mode_payload(payload: dict[str, Any], *, mode: Literal["process", "winrate"]) -> None:
+    if mode == "winrate":
+        if "hf_processed_repo" in payload and "hf_repo" not in payload:
+            payload["hf_repo"] = payload["hf_processed_repo"]
+        if "hf_winrate_repo" in payload:
+            raise ValueError("Winrate config field 'hf_winrate_repo' was removed; use 'hf.repo' and 'hf.winrate_dir'.")
+
     hf_payload = payload.get("hf")
     if isinstance(hf_payload, Mapping):
         for key, value in hf_payload.items():
             if mode == "winrate":
                 if key == "repo":
-                    payload.setdefault("hf_processed_repo", value)
+                    payload.setdefault("hf_repo", value)
                     continue
                 if key == "branch":
                     payload.setdefault("hf_branch", value)
@@ -766,6 +1000,10 @@ def _normalize_mode_payload(payload: dict[str, Any], *, mode: Literal["process",
                 if key == "private":
                     payload.setdefault("hf_private", value)
                     continue
+                if key == "winrate_repo":
+                    raise ValueError(
+                        "Winrate config field 'hf.winrate_repo' was removed; use 'hf.repo' and 'hf.winrate_dir'."
+                    )
             payload.setdefault(f"hf_{key}", value)
 
     if "exclude_datasets" not in payload and "exclude_dataset" in payload:
@@ -783,9 +1021,9 @@ def _load_and_apply_config(
 ) -> None:
     try:
         payload = _load_config_payload(path, mode=mode)
+        _normalize_mode_payload(payload, mode=mode)
     except (FileNotFoundError, ValueError) as exc:
         parser.error(str(exc))
-    _normalize_mode_payload(payload, mode=mode)
 
     path_fields = {
         "process": {
@@ -811,8 +1049,8 @@ def _load_and_apply_config(
             "weight_policy": "weight_policy",
             "partial_datasets": "partial_datasets",
             "dataset_coverage": "dataset_coverage",
-            "hf_processed_repo": "hf_processed_repo",
-            "hf_winrate_repo": "hf_winrate_repo",
+            "hf_repo": "hf_repo",
+            "hf_winrate_dir": "hf_winrate_dir",
             "hf_branch": "hf_branch",
             "hf_token": "hf_token",
         },
@@ -822,9 +1060,6 @@ def _load_and_apply_config(
             "dry_run": "dry_run",
             "clean": "clean",
             "yes": "yes",
-            "process_incomplete": "process_incomplete",
-            "validate_manifest": "validate_manifest",
-            "strict_manifest": "strict_manifest",
             "hf_private": "hf_private",
         },
         "winrate": {"hf_processed_pull": "hf_processed_pull", "hf_private": "hf_private"},
@@ -838,11 +1073,20 @@ def _load_and_apply_config(
         "winrate": {"min_common": "min_common", "weight_cap": "weight_cap"},
     }[mode]
     float_fields = {
-        "process": {"hf_request_timeout": "hf_request_timeout"},
+        "process": {
+            "hf_request_timeout": "hf_request_timeout",
+            "max_results_missing_pct": "max_results_missing_pct",
+        },
         "winrate": {"epsilon": "epsilon"},
     }[mode]
     repeatable_fields = {
-        "process": {"status": "status", "exclude_datasets": "exclude_dataset", "exclude_models": "exclude_model"},
+        "process": {
+            "status": "status",
+            "exclude_datasets": "exclude_dataset",
+            "exclude_models": "exclude_model",
+            "replace_models": "replace_model",
+            "replace_envs": "replace_env",
+        },
         "winrate": {
             "include_models": "include_model",
             "exclude_models": "exclude_model",
@@ -855,7 +1099,11 @@ def _load_and_apply_config(
             _set_if_unset(args, attr, Path(str(payload[key])))
     for key, attr in string_fields.items():
         if key in payload and _is_unset(args, attr):
-            _set_if_unset(args, attr, str(payload[key]))
+            try:
+                resolved = _resolve_config_string_value(key, payload[key])
+            except ValueError as exc:
+                parser.error(str(exc))
+            _set_if_unset(args, attr, resolved)
     for key, attr in boolean_fields.items():
         if key in payload and _is_unset(args, attr):
             _set_if_unset(args, attr, bool(payload[key]))
@@ -891,14 +1139,15 @@ def _build_winrate_args_from_config(path: Path, *, parser: argparse.ArgumentPars
         exclude_model=None,
         exclude_dataset=None,
         partial_datasets=None,
-        hf_processed_repo=None,
+        hf_repo=None,
         hf_processed_pull=None,
-        hf_winrate_repo=None,
+        hf_winrate_dir=None,
         hf_branch=None,
         hf_token=None,
         hf_private=None,
     )
     _load_and_apply_config(args, path, mode="winrate", parser=parser)
+    args._output_dir_explicit = args.output_dir is not None
     _finalize_config_args(args, mode="winrate")
     return args
 
@@ -915,15 +1164,14 @@ def _finalize_config_args(args: argparse.Namespace, *, mode: Literal["process",
             "dry_run": False,
             "clean": False,
             "yes": False,
-            "process_incomplete": False,
-            "validate_manifest": True,
-            "strict_manifest": False,
+            "max_results_missing_pct": 2.5,
             "exclude_dataset": [],
             "exclude_model": [],
+            "replace_model": [],
+            "replace_env": [],
         },
         "winrate": {
             "processed_dir": DEFAULT_PROCESSED_DIR,
-            "output_dir": DEFAULT_WINRATE_DIR,
             "missing_policy": "neg-inf",
             "epsilon": 1e-9,
             "min_common": 0,
@@ -935,6 +1183,7 @@ def _finalize_config_args(args: argparse.Namespace, *, mode: Literal["process",
             "exclude_dataset": [],
             "partial_datasets": "strict",
             "hf_processed_pull": False,
+            "hf_winrate_dir": "winrate",
             "hf_private": False,
             "yes": False,
         },
@@ -942,11 +1191,21 @@ def _finalize_config_args(args: argparse.Namespace, *, mode: Literal["process",
     for attr, default in defaults.items():
         if getattr(args, attr, None) is None:
             setattr(args, attr, default)
+    if mode == "winrate" and getattr(args, "output_dir", None) is None:
+        args.output_dir = _default_winrate_output_dir(Path(args.processed_dir))
 
     if hasattr(args, "exclude_dataset"):
         args.exclude_dataset = _parse_repeatable_csv(args.exclude_dataset)
     if mode == "process" and hasattr(args, "exclude_model"):
         args.exclude_model = _parse_repeatable_csv(args.exclude_model)
+    if mode == "process" and hasattr(args, "replace_model"):
+        args.replace_model = _parse_repeatable_csv(args.replace_model)
+    if mode == "process" and hasattr(args, "replace_env"):
+        args.replace_env = _parse_repeatable_csv(args.replace_env)
+
+
+def _default_winrate_output_dir(processed_dir: Path) -> Path:
+    return Path(processed_dir) / "winrate"
 
 
 def _upload_winrate_outputs(
@@ -955,11 +1214,19 @@ def _upload_winrate_outputs(
     output_paths: Sequence[Path],
     repo_id: str,
     token: str | None,
+    branch: str | None,
     private: bool,
+    winrate_dir: str | None,
     assume_yes: bool = False,
 ) -> None:
     if not output_paths:
         return
+    raw_dir = "winrate" if winrate_dir is None else str(winrate_dir).strip()
+    if not raw_dir:
+        raw_dir = "winrate"
+    if resolve_under(Path("."), raw_dir) is None:
+        logger.error("Invalid winrate_dir '%s'; skipping upload.", winrate_dir)
+        return
     output_dir = Path(output_dir)
     files: list[str] = []
     for path in output_paths:
@@ -981,6 +1248,8 @@ def _upload_winrate_outputs(
         token=token,
         private=private,
         message=message,
+        branch=branch,
+        path_in_repo_prefix=raw_dir,
         is_tty=sys.stdin.isatty(),
         assume_yes=assume_yes,
         prompt_func=input,
@@ -993,20 +1262,21 @@ def _run_winrate_mode(argv: Sequence[str]) -> int:
 
     if args.config:
         _load_and_apply_config(args, args.config, mode="winrate", parser=parser)
+    args._output_dir_explicit = args.output_dir is not None
     _finalize_config_args(args, mode="winrate")
 
     hf_config = HFSyncConfig.from_cli(
-        repo=args.hf_processed_repo,
+        repo=args.hf_repo,
         branch=args.hf_branch,
         token=args.hf_token,
-        private=False,
+        private=bool(args.hf_private),
         dry_run=False,
     )
 
     if args.list_models:
         source_dir, datasets, source_desc = _resolve_source(
             args.processed_dir,
-            hf_config=hf_config if args.hf_processed_repo else None,
+            hf_config=hf_config if args.hf_repo else None,
             hf_processed_pull=bool(args.hf_processed_pull),
         )
         if args.exclude_dataset:
@@ -1054,13 +1324,15 @@ def _run_winrate_mode(argv: Sequence[str]) -> int:
 
     logger.info("Computed win rates for %d dataset(s): %s", len(winrate_result.datasets), winrate_result.output_path)
     print_winrate_summary_markdown(winrate_result.result)
-    if args.hf_winrate_repo:
+    if args.hf_repo:
         _upload_winrate_outputs(
             output_dir=args.output_dir,
             output_paths=winrate_result.output_paths,
-            repo_id=args.hf_winrate_repo,
+            repo_id=args.hf_repo,
             token=args.hf_token,
+            branch=args.hf_branch,
             private=bool(args.hf_private),
+            winrate_dir=args.hf_winrate_dir,
             assume_yes=bool(args.yes),
         )
     return 0
diff --git a/medarc_verifiers/cli/process/__init__.py b/medarc_verifiers/cli/process/__init__.py
index 6c20133e..35cb601d 100644
--- a/medarc_verifiers/cli/process/__init__.py
+++ b/medarc_verifiers/cli/process/__init__.py
@@ -1,5 +1,5 @@
 """Process command pipeline for exporting MedARC runs."""
 
-from .pipeline import ProcessOptions, ProcessResult, run_process
+from .pipeline import PROCESS_DEFAULT_STATUS_FILTER, ProcessOptions, ProcessResult, run_process
 
-__all__ = ["ProcessOptions", "ProcessResult", "run_process"]
+__all__ = ["PROCESS_DEFAULT_STATUS_FILTER", "ProcessOptions", "ProcessResult", "run_process"]
diff --git a/medarc_verifiers/cli/process/aggregate.py b/medarc_verifiers/cli/process/aggregate.py
index b00d5ff2..f6a25966 100644
--- a/medarc_verifiers/cli/process/aggregate.py
+++ b/medarc_verifiers/cli/process/aggregate.py
@@ -6,7 +6,8 @@
 from dataclasses import dataclass
 from typing import Any, Iterable, Mapping
 
-from medarc_verifiers.cli.process.rollout import derive_base_env_id
+from medarc_verifiers.cli.process.metadata import RunIdentity
+from medarc_verifiers.cli.process.rollout import extract_rollout_index
 
 logger = logging.getLogger(__name__)
 
@@ -25,9 +26,17 @@ class AggregatedEnvRows:
 
 def aggregate_rows_by_env(
     rows: Iterable[Mapping[str, Any]],
+    *,
+    identities: Iterable[RunIdentity] | None = None,
 ) -> list[AggregatedEnvRows]:
     """Group enriched rows by (model_id, base_env_id), capturing unioned schemas."""
     groups: dict[tuple[str, str], dict[str, Any]] = {}
+    identity_list = list(identities or ())
+    fake_rollout_groups = {
+        (identity.model_id, identity.output_env_id)
+        for identity in identity_list
+        if identity.rollout_index is not None
+    }
 
     for row in rows:
         base_env_id = str(row.get("base_env_id") or row.get("env_id") or "")
@@ -68,7 +77,15 @@ def aggregate_rows_by_env(
         # processing "fake rollouts" that are created by running separate jobs with rollout suffixes
         # (e.g., env-a-rollout7) and then combining them under a shared base_env_id.
         normalized_rows: list[Mapping[str, Any]] = list(group["rows"])  # shallow copy
-        if _group_uses_rollout_suffixes(normalized_rows, base_env_id=group["base_env_id"] or key[1]):
+        if key in fake_rollout_groups:
+            _ensure_rollout_index_from_identities(
+                normalized_rows,
+                identities=identity_list,
+                model_id=group["model_id"],
+                base_env_id=group["base_env_id"] or key[1],
+            )
+            _normalize_rollout_indices(normalized_rows)
+        elif _group_uses_rollout_suffixes(normalized_rows, base_env_id=group["base_env_id"] or key[1]):
             _ensure_rollout_index_from_suffix(normalized_rows, base_env_id=group["base_env_id"] or key[1])
             _normalize_rollout_indices(normalized_rows)
         candidate_env_id = group["env_id"] or group["base_env_id"] or ""
@@ -85,13 +102,47 @@ def aggregate_rows_by_env(
     return aggregated
 
 
+def _ensure_rollout_index_from_identities(
+    rows: list[Mapping[str, Any]],
+    *,
+    identities: list[RunIdentity],
+    model_id: str,
+    base_env_id: str,
+) -> None:
+    rollout_by_manifest_env: dict[str, int] = {}
+    for identity in identities:
+        if identity.model_id != model_id or identity.output_env_id != base_env_id:
+            continue
+        if identity.rollout_index is None:
+            continue
+        rollout_by_manifest_env[identity.manifest_env_id] = identity.rollout_index
+
+    if not rollout_by_manifest_env:
+        return
+
+    for row in rows:
+        value = row.get("rollout_index")
+        if _coerce_rollout_index(value) is not None:
+            continue
+        manifest_env_id = row.get("manifest_env_id")
+        if not isinstance(manifest_env_id, str):
+            continue
+        resolved = rollout_by_manifest_env.get(manifest_env_id)
+        if resolved is None:
+            continue
+        try:
+            row["rollout_index"] = resolved
+        except TypeError:
+            continue
+
+
 def _group_uses_rollout_suffixes(rows: list[Mapping[str, Any]], *, base_env_id: str) -> bool:
     for row in rows:
         manifest_env_id = row.get("manifest_env_id")
         if not isinstance(manifest_env_id, str) or not manifest_env_id:
             continue
-        derived_base, _ = derive_base_env_id(manifest_env_id)
-        if derived_base and derived_base == base_env_id and manifest_env_id != derived_base:
+        row_base_env_id = str(row.get("base_env_id") or base_env_id or "")
+        if row_base_env_id and manifest_env_id != row_base_env_id:
             return True
     return False
 
@@ -104,8 +155,11 @@ def _ensure_rollout_index_from_suffix(rows: list[Mapping[str, Any]], *, base_env
         manifest_env_id = row.get("manifest_env_id")
         if not isinstance(manifest_env_id, str) or not manifest_env_id:
             continue
-        derived_base, derived_index = derive_base_env_id(manifest_env_id)
-        if not derived_base or derived_base != base_env_id:
+        row_base_env_id = str(row.get("base_env_id") or base_env_id or "")
+        if not row_base_env_id or manifest_env_id == row_base_env_id:
+            continue
+        derived_index = extract_rollout_index(manifest_env_id)
+        if derived_index <= 0:
             continue
         try:
             row["rollout_index"] = derived_index
diff --git a/medarc_verifiers/cli/process/discovery.py b/medarc_verifiers/cli/process/discovery.py
index fc583f10..7aba00f8 100644
--- a/medarc_verifiers/cli/process/discovery.py
+++ b/medarc_verifiers/cli/process/discovery.py
@@ -20,7 +20,6 @@
 logger = logging.getLogger(__name__)
 
 DEFAULT_STATUS = "unknown"
-_COMPLETED_STATUSES = {"completed", "succeeded", "success"}
 
 
 @dataclass(frozen=True, slots=True)
@@ -66,8 +65,10 @@ class RunRecord:
     reason: str | None
     started_at: str | None
     ended_at: str | None
+    avg_reward: float | None
     num_examples: int | None
     rollouts_per_example: int | None
+    row_count: int | None
     env_args: Mapping[str, Any]
     sampling_args: Mapping[str, Any]
     env_config: Mapping[str, Any] | None
@@ -78,17 +79,15 @@ def discover_run_records(
     runs_dir: Path | str,
     *,
     filter_status: Sequence[str] | None = None,
-    only_complete_runs: bool = False,
 ) -> list[RunRecord]:
     """Return all discovered run records within the provided runs directory."""
-    return list(iter_run_records(runs_dir, filter_status=filter_status, only_complete_runs=only_complete_runs))
+    return list(iter_run_records(runs_dir, filter_status=filter_status))
 
 
 def iter_run_records(
     runs_dir: Path | str,
     *,
     filter_status: Sequence[str] | None = None,
-    only_complete_runs: bool = False,
 ) -> Iterator[RunRecord]:
     """Yield run records for each job entry found under the runs directory."""
     runs_path = Path(runs_dir)
@@ -108,13 +107,6 @@ def iter_run_records(
         manifest_info, job_entries = _load_manifest(run_dir)
         if manifest_info is None:
             continue
-        if (
-            only_complete_runs
-            and manifest_info.summary_total_known
-            and manifest_info.summary_completed != manifest_info.summary_total
-        ):
-            # Skip entire run if not fully completed
-            continue
         summary_map = _load_run_summary(run_dir)
         for job_entry in job_entries:
             summary_entry = summary_map.get(job_entry.job_id or "")
@@ -194,8 +186,10 @@ def _build_run_record(
         reason=reason or job_entry.reason,
         started_at=job_entry.started_at,
         ended_at=job_entry.ended_at,
+        avg_reward=job_entry.avg_reward,
         num_examples=job_entry.num_examples,
         rollouts_per_example=job_entry.rollouts_per_example,
+        row_count=job_entry.row_count,
         env_args=env_args,
         sampling_args=sampling_args,
         env_config=env_config,
diff --git a/medarc_verifiers/cli/process/env_index.py b/medarc_verifiers/cli/process/env_index.py
index 89c85c37..86fecd50 100644
--- a/medarc_verifiers/cli/process/env_index.py
+++ b/medarc_verifiers/cli/process/env_index.py
@@ -54,21 +54,9 @@ def read_env_index_inventory(processed_dir: Path) -> EnvIndexInventory:
     """Read env_index.json and return a dataset inventory."""
     index_path = processed_dir / "env_index.json"
     payload = load_env_index(index_path)
-    version = payload.get("version") if isinstance(payload, Mapping) else None
-    if version == 2:
+    if isinstance(payload, Mapping) and int(payload.get("version") or 0) == 2:
         return _inventory_from_v2(payload, processed_dir)
-    return EnvIndexInventory(env_paths={}, version=int(version or 1))
-
-
-def read_env_index_runs(processed_dir: Path) -> tuple[int, dict[str, Mapping[str, Any]]]:
-    """Return env_index version and run metadata map."""
-    index_path = processed_dir / "env_index.json"
-    payload = load_env_index(index_path)
-    version = int(payload.get("version") or 1) if isinstance(payload, Mapping) else 1
-    runs = payload.get("runs") if isinstance(payload, Mapping) else None
-    if version != 2 or not isinstance(runs, Mapping):
-        return version, {}
-    return version, {str(k): v for k, v in runs.items() if isinstance(v, Mapping)}
+    return EnvIndexInventory(env_paths={}, version=0)
 
 
 def read_env_index_files(processed_dir: Path) -> dict[str, Mapping[str, Any]]:
@@ -118,7 +106,6 @@ def read_env_index_models(processed_dir: Path) -> set[str]:
 __all__ = [
     "EnvIndexInventory",
     "read_env_index_inventory",
-    "read_env_index_runs",
     "read_env_index_files",
     "read_env_index_models",
 ]
diff --git a/medarc_verifiers/cli/process/metadata.py b/medarc_verifiers/cli/process/metadata.py
index 6bfae643..118e63f8 100644
--- a/medarc_verifiers/cli/process/metadata.py
+++ b/medarc_verifiers/cli/process/metadata.py
@@ -4,6 +4,7 @@
 
 import json
 import logging
+import math
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any, Mapping, MutableMapping
@@ -21,6 +22,7 @@ class _MetadataPayload(BaseModel):
 
     env_id: str | None = None
     model: str | None = None
+    avg_reward: float | None = None
     version_info: dict[str, str | None] | None = None
     env_args: dict[str, Any] = Field(default_factory=dict)
     num_examples: int | None = None
@@ -32,6 +34,7 @@ class _MetadataPayload(BaseModel):
 class NormalizedMetadata:
     """Normalized view of metadata.json merged with manifest discovery data."""
 
+    identity: "RunIdentity"
     record: RunRecord
     metadata_path: Path | None
     raw_metadata: Mapping[str, Any]
@@ -47,13 +50,111 @@ class NormalizedMetadata:
     rollouts_per_example: int | None
 
 
+@dataclass(frozen=True, slots=True)
+class RunIdentity:
+    """Canonical identity for selecting and exporting a discovered run record."""
+
+    model_id: str
+    manifest_env_id: str
+    base_env_id: str
+    rollout_index: int | None
+    job_run_id: str
+    output_env_id: str
+
+
+@dataclass(frozen=True, slots=True)
+class ResolvedRunIdentity:
+    """Selection-time identity that tolerates missing model ids."""
+
+    model_id: str | None
+    manifest_env_id: str
+    base_env_id: str
+    rollout_index: int | None
+    job_run_id: str
+    output_env_id: str
+
+
+@dataclass(frozen=True, slots=True)
+class _ResolvedMetadataContext:
+    raw_metadata: Mapping[str, Any]
+    manifest_env_id: str
+    metadata_env_id: str | None
+    base_env_id: str
+    rollout_index: int
+    model_id: str | None
+    metadata_model: str | None
+    env_args: Mapping[str, Any]
+    sampling_args: Mapping[str, Any]
+    num_examples: int | None
+    rollouts_per_example: int | None
+
+
+def resolve_run_identity(
+    record: RunRecord,
+    *,
+    combine_rollouts: bool = True,
+) -> ResolvedRunIdentity:
+    """Resolve a run identity for selection without requiring model_id."""
+    context = _resolve_metadata_context(record, combine_rollouts=combine_rollouts)
+    resolved_rollout_index = (
+        context.rollout_index if context.rollout_index != 0 or context.manifest_env_id != context.base_env_id else None
+    )
+    return ResolvedRunIdentity(
+        model_id=context.model_id,
+        manifest_env_id=context.manifest_env_id,
+        base_env_id=context.base_env_id,
+        rollout_index=resolved_rollout_index,
+        job_run_id=record.manifest.job_run_id,
+        output_env_id=context.base_env_id or context.manifest_env_id or record.job_id,
+    )
+
+
 def load_normalized_metadata(
     record: RunRecord,
     *,
     combine_rollouts: bool = True,
 ) -> NormalizedMetadata:
     """Merge manifest fields with metadata.json (when present)."""
+    context = _resolve_metadata_context(record, combine_rollouts=combine_rollouts)
+    if not context.model_id:
+        raise RuntimeError(format_missing_model_id_error(record))
+    resolved_rollout_index = (
+        context.rollout_index if context.rollout_index != 0 or context.manifest_env_id != context.base_env_id else None
+    )
+    identity = RunIdentity(
+        model_id=context.model_id,
+        manifest_env_id=context.manifest_env_id,
+        base_env_id=context.base_env_id,
+        rollout_index=resolved_rollout_index,
+        job_run_id=record.manifest.job_run_id,
+        output_env_id=context.base_env_id or context.manifest_env_id or record.job_id,
+    )
+
+    return NormalizedMetadata(
+        identity=identity,
+        record=record,
+        metadata_path=record.metadata_path if record.has_metadata else None,
+        raw_metadata=context.raw_metadata,
+        manifest_env_id=context.manifest_env_id,
+        metadata_env_id=context.metadata_env_id,
+        base_env_id=context.base_env_id,
+        rollout_index=identity.rollout_index or 0,
+        model_id=identity.model_id,
+        metadata_model=context.metadata_model,
+        env_args=context.env_args,
+        sampling_args=context.sampling_args,
+        num_examples=context.num_examples,
+        rollouts_per_example=context.rollouts_per_example,
+    )
+
+
+def _resolve_metadata_context(
+    record: RunRecord,
+    *,
+    combine_rollouts: bool,
+) -> _ResolvedMetadataContext:
     metadata_payload, raw_metadata = _load_metadata(record)
+    _warn_manifest_metadata_result_mismatch(record, metadata_payload)
     metadata_env_id = metadata_payload.env_id if metadata_payload else None
     metadata_model = metadata_payload.model if metadata_payload else None
     env_args = _merge_mappings(
@@ -64,7 +165,6 @@ def load_normalized_metadata(
         primary=record.sampling_args,
         fallback=metadata_payload.sampling_args if metadata_payload else None,
     )
-
     manifest_env_id = (
         _extract_env_config_id(record.env_config) or record.manifest_env_id or metadata_env_id or record.job_id
     )
@@ -72,34 +172,36 @@ def load_normalized_metadata(
         manifest_env_id,
         combine_rollouts=combine_rollouts,
     )
-    # If we didn't capture a rollout index from the manifest env id,
-    # try to derive it from the results directory name (common when
-    # manifests keep base env id, but the on-disk folder encodes the rollout).
     if rollout_index == 0 and record.results_dir_name:
         alt_index = extract_rollout_index(record.results_dir_name)
         if alt_index:
             rollout_index = alt_index
-
-    model_id = record.model_id or metadata_model
-    num_examples = record.num_examples or (metadata_payload.num_examples if metadata_payload else None)
-    rollouts_per_example = record.rollouts_per_example or (
-        metadata_payload.rollouts_per_example if metadata_payload else None
-    )
-
-    return NormalizedMetadata(
-        record=record,
-        metadata_path=record.metadata_path if record.has_metadata else None,
+    return _ResolvedMetadataContext(
         raw_metadata=raw_metadata,
         manifest_env_id=manifest_env_id,
         metadata_env_id=metadata_env_id,
         base_env_id=base_env_id,
         rollout_index=rollout_index,
-        model_id=model_id,
+        model_id=record.model_id or metadata_model,
         metadata_model=metadata_model,
         env_args=env_args,
         sampling_args=sampling_args,
-        num_examples=num_examples,
-        rollouts_per_example=rollouts_per_example,
+        num_examples=_prefer_manifest_value(
+            record.num_examples,
+            metadata_payload.num_examples if metadata_payload else None,
+        ),
+        rollouts_per_example=_prefer_manifest_value(
+            record.rollouts_per_example,
+            metadata_payload.rollouts_per_example if metadata_payload else None,
+        ),
+    )
+
+
+def format_missing_model_id_error(record: RunRecord) -> str:
+    return (
+        "Missing model_id for run "
+        f"(job_run_id={record.manifest.job_run_id}, job_id={record.job_id}, "
+        f"results_dir={record.results_dir}, manifest={record.manifest.manifest_path})"
     )
 
 
@@ -153,6 +255,50 @@ def _merge_mappings(
     return result
 
 
+def _prefer_manifest_value(primary: int | None, fallback: int | None) -> int | None:
+    if primary is not None:
+        return primary
+    return fallback
+
+
+def _warn_manifest_metadata_result_mismatch(record: RunRecord, metadata_payload: _MetadataPayload | None) -> None:
+    if metadata_payload is None:
+        return
+
+    mismatches: list[str] = []
+    if _has_float_mismatch(record.avg_reward, metadata_payload.avg_reward):
+        mismatches.append(
+            f"avg_reward manifest={record.avg_reward!r} metadata={metadata_payload.avg_reward!r}"
+        )
+    if _has_int_mismatch(record.num_examples, metadata_payload.num_examples):
+        mismatches.append(
+            f"num_examples manifest={record.num_examples!r} metadata={metadata_payload.num_examples!r}"
+        )
+    if not mismatches:
+        return
+
+    logger.warning(
+        "Manifest/metadata result mismatch for process input "
+        "(job_run_id=%s, job_id=%s, metadata=%s): %s",
+        record.manifest.job_run_id,
+        record.job_id,
+        record.metadata_path,
+        "; ".join(mismatches),
+    )
+
+
+def _has_float_mismatch(left: float | None, right: float | None) -> bool:
+    if left is None or right is None:
+        return False
+    return not math.isclose(left, right, rel_tol=1e-9, abs_tol=1e-9)
+
+
+def _has_int_mismatch(left: int | None, right: int | None) -> bool:
+    if left is None or right is None:
+        return False
+    return left != right
+
+
 def _extract_env_config_id(env_config: Mapping[str, Any] | None) -> str | None:
     if not env_config:
         return None
@@ -164,4 +310,11 @@ def _extract_env_config_id(env_config: Mapping[str, Any] | None) -> str | None:
     return None
 
 
-__all__ = ["NormalizedMetadata", "load_normalized_metadata"]
+__all__ = [
+    "NormalizedMetadata",
+    "ResolvedRunIdentity",
+    "RunIdentity",
+    "format_missing_model_id_error",
+    "load_normalized_metadata",
+    "resolve_run_identity",
+]
diff --git a/medarc_verifiers/cli/process/pipeline.py b/medarc_verifiers/cli/process/pipeline.py
index 36609ae5..b23fc16d 100644
--- a/medarc_verifiers/cli/process/pipeline.py
+++ b/medarc_verifiers/cli/process/pipeline.py
@@ -1,31 +1,27 @@
-"""Top-level pipeline wiring discovery, row loading, aggregation, and writing."""
+"""Top-level pipeline wiring discovery, selection, row loading, aggregation, and writing."""
 
 from __future__ import annotations
 
+import json
 import logging
 import sys
 from concurrent.futures import ProcessPoolExecutor, as_completed
 from dataclasses import dataclass, field
 from datetime import UTC, datetime
 from pathlib import Path
-from typing import Any, Callable, Iterable, Mapping, Sequence
+from typing import Any, Iterable, Mapping, Sequence
+
+import pyarrow.parquet as pq
 
-from medarc_verifiers.cli._schemas import EnvironmentExportConfig
 from medarc_verifiers.cli import hf as hf_sync
-from medarc_verifiers.cli.process import (
-    aggregate,
-    discovery,
-    env_index,
-    metadata,
-    rows,
-    rollout,
-    writer,
-    workspace,
-)
-from medarc_verifiers.cli.process.aggregate import AggregatedEnvRows
+from medarc_verifiers.cli._schemas import EnvironmentExportConfig
 from medarc_verifiers.cli.hf import HFSyncConfig, HFSyncSummary
-from medarc_verifiers.cli.process.writer import EnvWriteSummary, WriterConfig
+from medarc_verifiers.cli.process import aggregate, discovery, env_index, metadata, rollout, rows, workspace, writer
+from medarc_verifiers.cli.process.aggregate import AggregatedEnvRows
+from medarc_verifiers.cli.process.metadata import RunIdentity
+from medarc_verifiers.cli.process.writer import EXPORTER_METADATA_KEY, EnvWriteSummary, WriterConfig
 from medarc_verifiers.cli.utils.shared import (
+    count_jsonl_rows,
     dataset_is_excluded,
     model_is_excluded,
     normalize_dataset_ids,
@@ -33,6 +29,7 @@
 )
 
 logger = logging.getLogger(__name__)
+PROCESS_DEFAULT_STATUS_FILTER: tuple[str, ...] = ("completed",)
 
 
 @dataclass(slots=True)
@@ -41,12 +38,14 @@ class ProcessOptions:
 
     runs_dir: Path
     output_dir: Path
-    only_complete_runs: bool = True
+    max_results_missing_pct: float = 2.5
     exclude_datasets: Sequence[str] = field(default_factory=tuple)
     exclude_models: Sequence[str] = field(default_factory=tuple)
+    replace_models: Sequence[str] = field(default_factory=tuple)
+    replace_envs: Sequence[str] = field(default_factory=tuple)
     processed_at: str | None = None
     processed_with_args: Mapping[str, Any] = field(default_factory=dict)
-    status_filter: Sequence[str] = field(default_factory=tuple)
+    status_filter: Sequence[str] = field(default_factory=lambda: PROCESS_DEFAULT_STATUS_FILTER)
     dry_run: bool = False
     clean: bool = False
     assume_yes: bool = False
@@ -57,12 +56,15 @@ class ProcessOptions:
     def __post_init__(self) -> None:
         self.runs_dir = Path(self.runs_dir)
         self.output_dir = Path(self.output_dir)
+        self.max_results_missing_pct = float(self.max_results_missing_pct)
         self.max_workers = max(1, int(self.max_workers))
         if not self.processed_at:
             self.processed_at = datetime.now(UTC).replace(microsecond=0).isoformat().replace("+00:00", "Z")
         self.status_filter = tuple(str(status) for status in self.status_filter)
         self.exclude_datasets = tuple(str(value) for value in self.exclude_datasets if str(value).strip())
         self.exclude_models = tuple(str(value) for value in self.exclude_models if str(value).strip())
+        self.replace_models = tuple(str(value) for value in self.replace_models if str(value).strip())
+        self.replace_envs = tuple(str(value) for value in self.replace_envs if str(value).strip())
 
 
 @dataclass(slots=True)
@@ -76,8 +78,8 @@ class ProcessResult:
     hf_summary: HFSyncSummary | None
 
 
-@dataclass(slots=True)
-class _RecordWork:
+@dataclass(frozen=True, slots=True)
+class PlannedRecord:
     """Per-record settings for row loading."""
 
     normalized: metadata.NormalizedMetadata
@@ -86,26 +88,42 @@ class _RecordWork:
     answer_column: str | None
 
 
-@dataclass(slots=True)
-class _NormalizedRecord:
+@dataclass(frozen=True, slots=True)
+class PlannedWorkItem:
+    """A single selected (model, env) output to process."""
+
+    identity: RunIdentity
+    records: list[PlannedRecord]
+
+
+@dataclass(frozen=True, slots=True)
+class SelectionRecord:
+    """Selection-time record settings before full normalization."""
+
     record: discovery.RunRecord
-    normalized: metadata.NormalizedMetadata
+    identity: metadata.ResolvedRunIdentity
+    combine_rollouts: bool
     extra_columns: Sequence[str]
     drop_columns: Sequence[str]
     answer_column: str | None
-    model_key: str
-    env_key: str
-    job_run_id: str
-    run_timestamp: str
 
 
-@dataclass(slots=True)
-class _EnvGroupSelection:
-    model_key: str
-    env_key: str
-    job_run_id: str
-    run_timestamp: str
-    records: list[_NormalizedRecord]
+@dataclass(frozen=True, slots=True)
+class SelectionWorkItem:
+    """A selected work item before metadata normalization."""
+
+    identity: metadata.ResolvedRunIdentity
+    records: list[SelectionRecord]
+
+
+@dataclass(frozen=True, slots=True)
+class SelectionResult:
+    """Complete output of the selection phase."""
+
+    work_items: list[PlannedWorkItem]
+    skipped_by_delta: int
+    skipped_by_exclusion: int
+    total_discovered: int
 
 
 def run_process(
@@ -117,101 +135,54 @@ def run_process(
     env_export_map = env_export_map or {}
 
     def _run_pipeline() -> ProcessResult:
-        if not options.dry_run and options.clean:
-            _confirm_clean_process(
-                options.output_dir,
-                assume_yes=options.assume_yes,
-                is_tty=sys.stdin.isatty(),
-                prompt_func=input,
-            )
-            workspace.clear_output_dir(options.output_dir)
-        if not options.dry_run and options.hf_config and options.hf_config.repo_id and not options.clean:
-            workspace.prepare_hf_baseline(
+        baseline_result: workspace.BaselineResult | None = None
+        if not options.dry_run:
+            preparation = workspace.prepare_output_workspace(
                 output_dir=options.output_dir,
                 hf_config=options.hf_config,
                 pull_policy=options.hf_pull_policy,
+                clean=options.clean,
+                assume_yes=options.assume_yes,
                 is_tty=sys.stdin.isatty(),
                 prompt_func=input,
             )
+            if preparation is not None:
+                baseline_result = preparation.baseline_result
 
-        index_version, index_runs = env_index.read_env_index_runs(options.output_dir)
-        index_files = env_index.read_env_index_files(options.output_dir)
-        if options.clean:
-            index_version = 0
-            index_runs = {}
-            index_files = {}
-
+        index_files = {} if options.clean else env_index.read_env_index_files(options.output_dir)
         discovered = discovery.discover_run_records(
             options.runs_dir,
             filter_status=options.status_filter or None,
-            only_complete_runs=False,
         )
-
-        use_delta = index_version == 2 and not options.clean
-        if index_version != 2 and not options.clean:
-            logger.info("Delta processing disabled: missing or legacy env_index.json; running full reprocess.")
-        records: list[discovery.RunRecord] = list(discovered)
-        if options.only_complete_runs:
-            records = [
-                record
-                for record in records
-                if not (
-                    record.manifest.summary_total_known
-                    and record.manifest.summary_completed != record.manifest.summary_total
-                )
-            ]
-        normalized_records = _normalize_records(records, env_export_map)
-        env_groups = _select_latest_env_groups(normalized_records)
-        if use_delta:
-            env_groups = _filter_env_groups_by_delta(
-                env_groups,
-                index_runs,
-                index_files,
-                output_dir=options.output_dir,
-            )
-        if options.exclude_datasets:
-            env_groups = _filter_env_groups_by_exclusion(env_groups, options.exclude_datasets)
-        if options.exclude_models:
-            env_groups = _filter_env_groups_by_model_exclusion(env_groups, options.exclude_models)
-        records = [item.record for group in env_groups for item in group.records]
-
+        selection = select_work_items(
+            discovered,
+            options=options,
+            env_export_map=env_export_map,
+            index_files=index_files,
+        )
+        selected_records = [planned.normalized.record for item in selection.work_items for planned in item.records]
         _print_records_table(
             discovered,
-            records,
-            options.only_complete_runs,
+            selected_records,
+            options.max_results_missing_pct,
             exclude_datasets=options.exclude_datasets,
             exclude_models=options.exclude_models,
+            skipped_by_delta=selection.skipped_by_delta,
+            skipped_by_exclusion=selection.skipped_by_exclusion,
         )
 
-        grouped: dict[tuple[str, str], list[_RecordWork]] = {}
         run_metadata: dict[str, dict[str, Any]] = {}
-        record_items = [item for group in env_groups for item in group.records]
-        record_iter: Iterable[_NormalizedRecord] = record_items
-        try:
-            from rich.progress import track
-
-            record_iter = track(record_items, description="Reading run outputs", transient=True)
-        except Exception:
-            pass
-
-        for record in record_iter:
-            normalized = record.normalized
-            grouped.setdefault((record.model_key, record.env_key), []).append(
-                _RecordWork(
-                    normalized=normalized,
-                    extra_columns=record.extra_columns,
-                    drop_columns=record.drop_columns,
-                    answer_column=record.answer_column,
+        for item in selection.work_items:
+            for planned in item.records:
+                record = planned.normalized.record
+                run_metadata.setdefault(
+                    record.manifest.job_run_id,
+                    {
+                        "created_at": record.manifest.created_at,
+                        "updated_at": _source_updated_at(record),
+                        "config_checksum": record.manifest.config_checksum,
+                    },
                 )
-            )
-            run_metadata.setdefault(
-                record.job_run_id,
-                {
-                    "created_at": record.record.manifest.created_at,
-                    "updated_at": _source_updated_at(record.record),
-                    "config_checksum": record.record.manifest.config_checksum,
-                },
-            )
 
         writer_config = WriterConfig(
             output_dir=options.output_dir,
@@ -223,20 +194,22 @@ def _run_pipeline() -> ProcessResult:
         env_groups: list[AggregatedEnvRows] = []
         env_summaries: list[EnvWriteSummary] = []
         rows_processed = 0
+        work_items = sorted(
+            selection.work_items, key=lambda item: (item.identity.model_id, item.identity.output_env_id)
+        )
 
-        env_items = sorted(grouped.items())
         try:
-            if options.max_workers <= 1 or len(env_items) <= 1:
-                env_iter: Iterable[tuple[tuple[str, str], list[_RecordWork]]] = env_items
+            if options.max_workers <= 1 or len(work_items) <= 1:
+                work_iter: Iterable[PlannedWorkItem] = work_items
                 try:
                     from rich.progress import track
 
-                    env_iter = track(env_items, description="Processing datasets", transient=True)
+                    work_iter = track(work_items, description="Processing datasets", transient=True)
                 except Exception:
-                    env_iter = env_items
+                    work_iter = work_items
 
-                for _, work_items in env_iter:
-                    aggregated, row_count = _process_env_group(work_items)
+                for item in work_iter:
+                    aggregated, row_count = _process_env_group(item)
                     rows_processed += row_count
                     env_groups.extend(aggregated)
                     summaries = writer.write_env_groups(aggregated, writer_config, write_index=False)
@@ -249,8 +222,8 @@ def _run_pipeline() -> ProcessResult:
                 futures = []
                 try:
                     executor = ProcessPoolExecutor(max_workers=options.max_workers)
-                    for _, work_items in env_items:
-                        futures.append(executor.submit(_process_env_group, work_items))
+                    for item in work_items:
+                        futures.append(executor.submit(_process_env_group, item))
 
                     future_iter: Iterable[Any] = as_completed(futures)
                     try:
@@ -273,8 +246,8 @@ def _run_pipeline() -> ProcessResult:
                                 group.rows.clear()
                 except KeyboardInterrupt:
                     logger.warning("Processing cancelled by user; shutting down workers.")
-                    for f in futures:
-                        f.cancel()
+                    for future in futures:
+                        future.cancel()
                     if executor is not None:
                         executor.shutdown(cancel_futures=True)
                     raise
@@ -296,11 +269,20 @@ def _run_pipeline() -> ProcessResult:
 
         hf_summary: HFSyncSummary | None = None
         if options.hf_config:
+            files_to_upload: list[str] | None = None
+            if baseline_result is not None and baseline_result.policy == "continue-upload":
+                touched_files = hf_sync.collect_changed_output_files(
+                    env_summaries,
+                    output_dir=options.output_dir,
+                    metadata_paths=metadata_paths,
+                )
+                files_to_upload = sorted(set(baseline_result.pending_parquet_uploads) | set(touched_files))
             hf_summary = hf_sync.sync_to_hub(
                 env_summaries,
                 options.hf_config,
                 output_dir=options.output_dir,
                 metadata_paths=metadata_paths,
+                files=files_to_upload,
                 is_tty=sys.stdin.isatty(),
                 assume_yes=options.assume_yes,
                 prompt_func=input,
@@ -310,7 +292,7 @@ def _run_pipeline() -> ProcessResult:
             env_groups = [_strip_env_group_rows(group) for group in env_groups]
 
         return ProcessResult(
-            records_processed=len(records),
+            records_processed=len(selected_records),
             rows_processed=rows_processed,
             env_groups=env_groups,
             env_summaries=env_summaries,
@@ -323,31 +305,329 @@ def _run_pipeline() -> ProcessResult:
     return _run_pipeline()
 
 
+def select_work_items(
+    discovered: Sequence[discovery.RunRecord],
+    *,
+    options: ProcessOptions,
+    env_export_map: Mapping[str, EnvironmentExportConfig],
+    index_files: Mapping[str, Mapping[str, Any]],
+) -> SelectionResult:
+    """Filter discovered runs down to selected work items before row loading begins."""
+    planned_records = [_plan_selection_record(record, env_export_map) for record in discovered]
+    _raise_for_latest_invalid_selection(planned_records)
+    work_items = _materialize_work_items(
+        _select_latest_work_items([record for record in planned_records if record.identity.model_id])
+    )
+
+    work_items, skipped_by_exclusion = _apply_exclusions(
+        work_items,
+        exclude_datasets=options.exclude_datasets,
+        exclude_models=options.exclude_models,
+    )
+    _validate_replace_targets(work_items, options)
+    work_items, skipped_by_delta = _apply_additive_delta(work_items, options=options, index_files=index_files)
+    _validate_selected_results_completeness(work_items, max_results_missing_pct=options.max_results_missing_pct)
+
+    return SelectionResult(
+        work_items=work_items,
+        skipped_by_delta=skipped_by_delta,
+        skipped_by_exclusion=skipped_by_exclusion,
+        total_discovered=len(discovered),
+    )
+
+
 def _resolve_env_export(
     manifest_env_id: str | None,
     env_export_map: Mapping[str, EnvironmentExportConfig],
-) -> EnvironmentExportConfig | None:
+) -> EnvironmentExportConfig:
     if not manifest_env_id:
-        return None
+        return EnvironmentExportConfig()
     if manifest_env_id in env_export_map:
         return env_export_map[manifest_env_id]
     base_env_id, _ = rollout.derive_base_env_id(manifest_env_id)
     if base_env_id and base_env_id in env_export_map:
         return env_export_map[base_env_id]
-    return None
+    return EnvironmentExportConfig()
 
 
 def _resolve_columns(env_columns: Sequence[str]) -> Sequence[str]:
     return tuple(str(column).strip() for column in env_columns if str(column).strip())
 
 
+def _plan_selection_record(
+    record: discovery.RunRecord,
+    env_export_map: Mapping[str, EnvironmentExportConfig],
+) -> SelectionRecord:
+    env_export = _resolve_env_export(record.manifest_env_id, env_export_map)
+    combine_rollouts = bool(env_export.combine_rollouts)
+    identity = metadata.resolve_run_identity(record, combine_rollouts=combine_rollouts)
+    return SelectionRecord(
+        record=record,
+        identity=identity,
+        combine_rollouts=combine_rollouts,
+        extra_columns=_resolve_columns(env_export.extra_columns),
+        drop_columns=_resolve_columns(env_export.drop_columns),
+        answer_column=env_export.answer_column,
+    )
+
+
+def _raise_for_latest_invalid_selection(records: Sequence[SelectionRecord]) -> None:
+    latest_by_target: dict[tuple[str, str], SelectionRecord] = {}
+    for planned in records:
+        selection_key = (planned.identity.output_env_id, planned.record.job_id)
+        current = latest_by_target.get(selection_key)
+        if current is None or _run_sort_key(
+            _source_updated_at(planned.record),
+            planned.record.manifest.job_run_id,
+        ) > _run_sort_key(_source_updated_at(current.record), current.record.manifest.job_run_id):
+            latest_by_target[selection_key] = planned
+
+    invalid_latest = [planned for planned in latest_by_target.values() if not planned.identity.model_id]
+    if not invalid_latest:
+        return
+
+    failing = sorted(
+        invalid_latest,
+        key=lambda planned: (
+            planned.identity.output_env_id,
+            _run_sort_key(_source_updated_at(planned.record), planned.record.manifest.job_run_id),
+        ),
+    )[-1]
+    raise RuntimeError(metadata.format_missing_model_id_error(failing.record))
+
+
+def _select_latest_work_items(records: Sequence[SelectionRecord]) -> list[SelectionWorkItem]:
+    grouped: dict[tuple[str, str], dict[str, list[SelectionRecord]]] = {}
+    run_timestamps: dict[str, str] = {}
+
+    for planned in records:
+        identity = planned.identity
+        if not identity.model_id:
+            continue
+        group_key = (identity.model_id, identity.output_env_id)
+        grouped.setdefault(group_key, {}).setdefault(identity.job_run_id, []).append(planned)
+        run_timestamps.setdefault(identity.job_run_id, _source_updated_at(planned.record))
+
+    selected: list[SelectionWorkItem] = []
+    for _, run_groups in grouped.items():
+        latest_run_id = max(run_groups.keys(), key=lambda run_id: _run_sort_key(run_timestamps.get(run_id, ""), run_id))
+        latest_records = run_groups[latest_run_id]
+        representative = latest_records[0]
+        selected.append(
+            SelectionWorkItem(
+                identity=representative.identity,
+                records=list(latest_records),
+            )
+        )
+    return selected
+
+
+def _materialize_work_items(items: Sequence[SelectionWorkItem]) -> list[PlannedWorkItem]:
+    materialized: list[PlannedWorkItem] = []
+    for item in items:
+        records: list[PlannedRecord] = []
+        for selected in item.records:
+            normalized = metadata.load_normalized_metadata(
+                selected.record,
+                combine_rollouts=selected.combine_rollouts,
+            )
+            records.append(
+                PlannedRecord(
+                    normalized=normalized,
+                    extra_columns=selected.extra_columns,
+                    drop_columns=selected.drop_columns,
+                    answer_column=selected.answer_column,
+                )
+            )
+        materialized.append(PlannedWorkItem(identity=records[0].normalized.identity, records=records))
+    return materialized
+
+
+def _apply_exclusions(
+    work_items: Sequence[PlannedWorkItem],
+    *,
+    exclude_datasets: Sequence[str],
+    exclude_models: Sequence[str],
+) -> tuple[list[PlannedWorkItem], int]:
+    exclude_dataset_set = normalize_dataset_ids(exclude_datasets, label="process exclude dataset")
+    exclude_model_set = normalize_model_ids(exclude_models, label="process exclude model")
+    filtered: list[PlannedWorkItem] = []
+    skipped = 0
+    for item in work_items:
+        if exclude_dataset_set and _env_is_excluded(item.identity.output_env_id, exclude_dataset_set):
+            skipped += 1
+            continue
+        if exclude_model_set and model_is_excluded(item.identity.model_id, exclude_model_set):
+            skipped += 1
+            continue
+        filtered.append(item)
+    return filtered, skipped
+
+
+def _validate_replace_targets(work_items: Sequence[PlannedWorkItem], options: ProcessOptions) -> None:
+    if not options.replace_models and not options.replace_envs:
+        return
+
+    if options.replace_models:
+        matched_models = {
+            item.identity.model_id for item in work_items if item.identity.model_id in options.replace_models
+        }
+        if not matched_models:
+            raise RuntimeError(
+                "No selected processed outputs match --replace-model values: "
+                f"{', '.join(sorted(options.replace_models))}."
+            )
+    if options.replace_envs:
+        matched_envs = {
+            item.identity.output_env_id for item in work_items if item.identity.output_env_id in options.replace_envs
+        }
+        if not matched_envs:
+            raise RuntimeError(
+                f"No selected processed outputs match --replace-env values: {', '.join(sorted(options.replace_envs))}."
+            )
+    if options.replace_models and options.replace_envs:
+        intersection = [
+            item
+            for item in work_items
+            if item.identity.model_id in options.replace_models and item.identity.output_env_id in options.replace_envs
+        ]
+        if not intersection:
+            raise RuntimeError(
+                "No selected processed outputs match the intersection of --replace-model and --replace-env."
+            )
+
+
+def _apply_additive_delta(
+    work_items: Sequence[PlannedWorkItem],
+    *,
+    options: ProcessOptions,
+    index_files: Mapping[str, Mapping[str, Any]],
+) -> tuple[list[PlannedWorkItem], int]:
+    if options.clean:
+        return list(work_items), 0
+
+    filtered: list[PlannedWorkItem] = []
+    skipped = 0
+    for item in work_items:
+        output_path = writer.build_output_path(
+            options.output_dir,
+            model_id=item.identity.model_id,
+            env_id=item.identity.output_env_id,
+        )
+        if not output_path.exists():
+            filtered.append(item)
+            continue
+        if _should_replace_existing_output(item.identity, options):
+            filtered.append(item)
+            continue
+        parquet_metadata = _read_existing_output_metadata(output_path)
+        _validate_existing_output_integrity(
+            output_path,
+            output_dir=options.output_dir,
+            index_files=index_files,
+            parquet_metadata=parquet_metadata,
+        )
+        if not _existing_output_matches_selected_runs(item, parquet_metadata):
+            filtered.append(item)
+            continue
+        skipped += 1
+    return filtered, skipped
+
+
+def _should_replace_existing_output(identity: RunIdentity, options: ProcessOptions) -> bool:
+    if options.clean:
+        return True
+    has_model_filter = bool(options.replace_models)
+    has_env_filter = bool(options.replace_envs)
+    if not has_model_filter and not has_env_filter:
+        return False
+    if has_model_filter and has_env_filter:
+        return identity.model_id in options.replace_models and identity.output_env_id in options.replace_envs
+    if has_model_filter:
+        return identity.model_id in options.replace_models
+    return identity.output_env_id in options.replace_envs
+
+
+def _read_existing_output_metadata(output_path: Path) -> pq.FileMetaData:
+    try:
+        metadata_obj = pq.ParquetFile(output_path).metadata
+    except Exception as exc:  # noqa: BLE001
+        raise RuntimeError(
+            f"Existing processed output {output_path} is unreadable. "
+            "Rebuild it with --replace-model/--replace-env or re-run with --clean."
+        ) from exc
+
+    if metadata_obj is None:
+        raise RuntimeError(
+            f"Existing processed output {output_path} is missing parquet footer metadata. "
+            "Rebuild it with --replace-model/--replace-env or re-run with --clean."
+        )
+    return metadata_obj
+
+
+def _validate_existing_output_integrity(
+    output_path: Path,
+    *,
+    output_dir: Path,
+    index_files: Mapping[str, Mapping[str, Any]],
+    parquet_metadata: pq.FileMetaData | None = None,
+) -> None:
+    metadata_obj = parquet_metadata or _read_existing_output_metadata(output_path)
+
+    rel_key = output_path.relative_to(output_dir).as_posix()
+    index_entry = index_files.get(rel_key)
+    if not isinstance(index_entry, Mapping):
+        return
+    expected_row_count = index_entry.get("row_count")
+    if expected_row_count is None:
+        return
+    try:
+        expected = int(expected_row_count)
+    except (TypeError, ValueError):
+        return
+    actual = int(metadata_obj.num_rows)
+    if actual != expected:
+        raise RuntimeError(
+            f"Existing processed output {output_path} has {actual} parquet rows but env_index.json records {expected}. "
+            "Rebuild it with --replace-model/--replace-env or re-run with --clean."
+        )
+
+
+def _existing_output_matches_selected_runs(item: PlannedWorkItem, parquet_metadata: pq.FileMetaData) -> bool:
+    existing_run_ids = _extract_exporter_source_runs(parquet_metadata)
+    if existing_run_ids is None:
+        return False
+    selected_run_ids = {planned.normalized.record.manifest.job_run_id for planned in item.records}
+    return existing_run_ids == selected_run_ids
+
+
+def _extract_exporter_source_runs(parquet_metadata: pq.FileMetaData) -> set[str] | None:
+    metadata_map = parquet_metadata.metadata
+    if not metadata_map:
+        return None
+    payload = metadata_map.get(EXPORTER_METADATA_KEY)
+    if not payload:
+        return None
+    try:
+        exporter_metadata = json.loads(payload.decode("utf-8"))
+    except Exception:  # noqa: BLE001
+        return None
+    source_runs = exporter_metadata.get("source_runs")
+    if not isinstance(source_runs, list):
+        return None
+    run_ids = {str(run_id).strip() for run_id in source_runs if str(run_id).strip()}
+    return run_ids or None
+
+
 def _print_records_table(
     discovered: Sequence[discovery.RunRecord],
     selected: Sequence[discovery.RunRecord],
-    only_complete_runs: bool,
+    max_results_missing_pct: float,
     *,
     exclude_datasets: Sequence[str] = (),
     exclude_models: Sequence[str] = (),
+    skipped_by_delta: int = 0,
+    skipped_by_exclusion: int = 0,
 ) -> None:
     """Pretty-print job discovery vs planned processing."""
     exclude_set = normalize_dataset_ids(exclude_datasets, label="process exclude dataset")
@@ -355,71 +635,74 @@ def _print_records_table(
     eligible_discovered = [
         rec
         for rec in discovered
-        if (not only_complete_runs or _manifest_is_complete(rec.manifest))
-        and not (exclude_set and _record_is_excluded(rec, exclude_set))
+        if not (exclude_set and _record_is_excluded(rec, exclude_set))
         and not (exclude_model_set and _record_model_is_excluded(rec, exclude_model_set))
     ]
     total_by_model: dict[str, int] = {}
     completed_by_model: dict[str, int] = {}
     selected_by_model: dict[str, int] = {}
-    completed_statuses = {"completed", "succeeded", "success"}
     for rec in eligible_discovered:
         model_id = rec.model_id or "unknown"
         total_by_model[model_id] = total_by_model.get(model_id, 0) + 1
-        if (rec.status or "").lower() in completed_statuses:
+        if (rec.status or "").lower() in PROCESS_DEFAULT_STATUS_FILTER:
             completed_by_model[model_id] = completed_by_model.get(model_id, 0) + 1
     for rec in selected:
         model_id = rec.model_id or "unknown"
         selected_by_model[model_id] = selected_by_model.get(model_id, 0) + 1
 
     models = sorted(set(total_by_model.keys()) | set(selected_by_model.keys()))
-    selected_models = sorted(m for m, c in selected_by_model.items() if c > 0)
-    discovered_jobs_total = sum(total_by_model.get(m, 0) for m in models)
-    selected_jobs_total = sum(selected_by_model.get(m, 0) for m in models)
+    selected_models = sorted(model_id for model_id, count in selected_by_model.items() if count > 0)
+    discovered_jobs_total = sum(total_by_model.get(model_id, 0) for model_id in models)
+    selected_jobs_total = sum(selected_by_model.get(model_id, 0) for model_id in models)
 
     try:
         from rich.console import Console
-        from rich.table import Table
         from rich.markup import escape
+        from rich.table import Table
     except Exception:
-        suffix = " (complete runs only)" if only_complete_runs else ""
         logger.info(
-            "Processing %d job(s) across %d model(s)%s (found %d job(s) across %d model(s)).",
+            "Processing %d job(s) across %d model(s) (max_results_missing_pct=%s; found %d eligible job(s) across %d model(s)); "
+            "excluded=%d existing=%d.",
             selected_jobs_total,
             len(selected_models),
-            suffix,
+            _format_missing_pct(max_results_missing_pct),
             discovered_jobs_total,
             len(models),
+            skipped_by_exclusion,
+            skipped_by_delta,
         )
         for model_id in models:
-            comp = completed_by_model.get(model_id, 0)
-            tot = total_by_model.get(model_id, 0)
-            sel = selected_by_model.get(model_id, 0)
-            logger.info("  - %s: selected=%d; %d/%d completed", model_id, sel, comp, tot)
+            completed = completed_by_model.get(model_id, 0)
+            total = total_by_model.get(model_id, 0)
+            selected_count = selected_by_model.get(model_id, 0)
+            logger.info("  - %s: selected=%d; %d/%d completed", model_id, selected_count, completed, total)
         return
 
     console = Console()
-    title = f"Processing {selected_jobs_total} job(s) across {len(selected_models)} model(s)"
-    if only_complete_runs:
-        title += " (complete runs only)"
-    found_suffix = "after filters" if (exclude_set or only_complete_runs) else "pre-aggregation"
-    title += f" [dim](found {discovered_jobs_total} job(s) across {len(models)} model(s); {found_suffix})[/dim]"
+    title = (
+        f"Processing {selected_jobs_total} job(s) across {len(selected_models)} model(s) "
+        f"[dim](max_results_missing_pct={_format_missing_pct(max_results_missing_pct)})[/dim]"
+    )
+    title += (
+        f" [dim](found {discovered_jobs_total} eligible job(s); excluded={skipped_by_exclusion}, "
+        f"existing={skipped_by_delta})[/dim]"
+    )
     table = Table(title=title, show_header=True, header_style="bold cyan", caption=None)
     table.add_column("Model", style="magenta")
     table.add_column("Jobs (completed/total)", style="green", justify="right")
     table.add_column("Selected", style="cyan", justify="right")
 
     for model_id in models:
-        comp = completed_by_model.get(model_id, 0)
-        tot = total_by_model.get(model_id, 0)
-        sel = selected_by_model.get(model_id, 0)
-        table.add_row(escape(str(model_id)), f"{comp}/{tot}", str(sel))
+        completed = completed_by_model.get(model_id, 0)
+        total = total_by_model.get(model_id, 0)
+        selected_count = selected_by_model.get(model_id, 0)
+        table.add_row(escape(str(model_id)), f"{completed}/{total}", str(selected_count))
 
     console.print(table)
 
 
-def _manifest_is_complete(manifest: discovery.RunManifestInfo) -> bool:
-    return not (manifest.summary_total_known and manifest.summary_completed != manifest.summary_total)
+def _format_missing_pct(value: float) -> str:
+    return f"{float(value):g}"
 
 
 def _record_is_excluded(record: discovery.RunRecord, exclude_set: set[str]) -> bool:
@@ -434,90 +717,152 @@ def _record_is_excluded(record: discovery.RunRecord, exclude_set: set[str]) -> b
 
 
 def _record_model_is_excluded(record: discovery.RunRecord, exclude_model_set: set[str]) -> bool:
-    model_id = str(record.model_id or "").strip()
-    return model_is_excluded(model_id, exclude_model_set)
+    return model_is_excluded(str(record.model_id or "").strip(), exclude_model_set)
 
 
-__all__ = ["ProcessOptions", "ProcessResult", "run_process"]
+def _validate_selected_results_completeness(
+    work_items: Sequence[PlannedWorkItem],
+    *,
+    max_results_missing_pct: float,
+) -> None:
+    missing_files: list[str] = []
+    violations: list[str] = []
+    ungateable = 0
+
+    for item in work_items:
+        for planned in item.records:
+            normalized = planned.normalized
+            record = normalized.record
+            if not record.results_path.exists():
+                missing_files.append(
+                    "model_id={model_id} output_env_id={output_env_id} manifest_env_id={manifest_env_id} "
+                    "job_run_id={job_run_id} job_id={job_id} results_path={results_path}".format(
+                        model_id=item.identity.model_id,
+                        output_env_id=item.identity.output_env_id,
+                        manifest_env_id=normalized.manifest_env_id,
+                        job_run_id=record.manifest.job_run_id,
+                        job_id=record.job_id,
+                        results_path=record.results_path,
+                    )
+                )
+                continue
+
+            expected_rows = _expected_results_rows(normalized)
+            observed_rows = _completeness_observed_rows(record, expected_rows=expected_rows, threshold=max_results_missing_pct)
+            if expected_rows is None or observed_rows is None:
+                ungateable += 1
+                continue
+
+            missing_pct = _results_missing_pct(expected_rows=expected_rows, observed_rows=observed_rows)
+            if missing_pct > max_results_missing_pct:
+                violations.append(
+                    "model_id={model_id} output_env_id={output_env_id} manifest_env_id={manifest_env_id} "
+                    "job_run_id={job_run_id} job_id={job_id} expected_rows={expected_rows} "
+                    "observed_rows={observed_rows} missing_pct={missing_pct:.2f} threshold={threshold:g}".format(
+                        model_id=item.identity.model_id,
+                        output_env_id=item.identity.output_env_id,
+                        manifest_env_id=normalized.manifest_env_id,
+                        job_run_id=record.manifest.job_run_id,
+                        job_id=record.job_id,
+                        expected_rows=expected_rows,
+                        observed_rows=observed_rows,
+                        missing_pct=missing_pct,
+                        threshold=float(max_results_missing_pct),
+                    )
+                )
 
+    if ungateable:
+        logger.warning(
+            "Results row completeness gate could not be applied to %d selected record(s) because expected_rows "
+            "(num_examples * rollouts_per_example) or manifest row_count was unknown.",
+            ungateable,
+        )
 
-def _process_env_group(
-    work_items: Sequence[_RecordWork],
-) -> tuple[list[AggregatedEnvRows], int]:
-    """Load and aggregate all rows for a single environment."""
-    row_buffer: list[dict[str, Any]] = []
-    for work in work_items:
-        row_batch = rows.load_rows(
-            work.normalized,
-            extra_columns=work.extra_columns,
-            drop_columns=work.drop_columns,
-            answer_column=work.answer_column,
+    if not missing_files and not violations:
+        return
+
+    message_parts: list[str] = []
+    if missing_files:
+        missing_lines = "\n".join(f"  - {line}" for line in missing_files)
+        message_parts.append("Selected records are missing results.jsonl files:\n" + missing_lines)
+    if violations:
+        violation_lines = "\n".join(f"  - {line}" for line in violations)
+        message_parts.append(
+            "Selected records exceeded --max-results-missing-pct based on manifest row_count and expected rows:\n"
+            + violation_lines
         )
-        row_buffer.extend(row_batch)
-    aggregated = aggregate.aggregate_rows_by_env(
-        row_buffer,
-    )
-    return aggregated, len(row_buffer)
+    raise RuntimeError("\n\n".join(message_parts))
 
 
-def _source_updated_at(record: discovery.RunRecord) -> str:
-    return record.manifest.updated_at or record.manifest.created_at or ""
+def _expected_results_rows(normalized: metadata.NormalizedMetadata) -> int | None:
+    num_examples = normalized.num_examples
+    rollouts_per_example = normalized.rollouts_per_example
+    if num_examples is None or rollouts_per_example is None:
+        return None
+    if num_examples == -1:
+        return None
+    if num_examples <= 0 or rollouts_per_example <= 0:
+        return None
+    return int(num_examples) * int(rollouts_per_example)
 
 
-def _filter_env_groups_by_delta(
-    env_groups: Sequence[_EnvGroupSelection],
-    index_runs: Mapping[str, Mapping[str, Any]],
-    index_files: Mapping[str, Mapping[str, Any]],
+def _results_missing_pct(*, expected_rows: int, observed_rows: int) -> float:
+    if expected_rows <= 0:
+        return 0.0
+    missing_rows = max(int(expected_rows) - max(int(observed_rows), 0), 0)
+    return 100.0 * missing_rows / int(expected_rows)
+
+
+def _completeness_observed_rows(
+    record: discovery.RunRecord,
     *,
-    output_dir: Path,
-) -> list[_EnvGroupSelection]:
-    filtered: list[_EnvGroupSelection] = []
-    for group in env_groups:
-        expected_path = writer.build_output_path(output_dir, model_id=group.model_key, env_id=group.env_key)
-        expected_rel = expected_path.relative_to(output_dir).as_posix()
-        prior_file = index_files.get(expected_rel, {})
-        if not prior_file:
-            filtered.append(group)
-            continue
-        prior_updated_at = str(prior_file.get("updated_at") or prior_file.get("created_at") or "")
-        if group.job_run_id not in index_runs:
-            filtered.append(group)
-            continue
-        if _is_newer_timestamp(group.run_timestamp, prior_updated_at):
-            filtered.append(group)
-            continue
-    return filtered
+    expected_rows: int | None,
+    threshold: float,
+) -> int | None:
+    observed_rows = record.row_count
+    if expected_rows is None or observed_rows is None:
+        return observed_rows
+
+    missing_pct = _results_missing_pct(expected_rows=expected_rows, observed_rows=observed_rows)
+    if missing_pct <= threshold:
+        return observed_rows
+
+    actual_rows = count_jsonl_rows(record.results_path)
+    if actual_rows is None or actual_rows == observed_rows:
+        return observed_rows
+
+    logger.warning(
+        "Manifest row_count mismatch for process input "
+        "(job_run_id=%s, job_id=%s, results_path=%s): manifest row_count=%s actual_rows=%s. "
+        "Using actual_rows for completeness validation.",
+        record.manifest.job_run_id,
+        record.job_id,
+        record.results_path,
+        observed_rows,
+        actual_rows,
+    )
+    return actual_rows
 
 
-def _filter_env_groups_by_exclusion(
-    env_groups: Sequence[_EnvGroupSelection],
-    exclude_datasets: Sequence[str],
-) -> list[_EnvGroupSelection]:
-    exclude_set = normalize_dataset_ids(exclude_datasets, label="process exclude dataset")
-    if not exclude_set:
-        return list(env_groups)
-    filtered: list[_EnvGroupSelection] = []
-    for group in env_groups:
-        if _env_is_excluded(str(group.env_key or ""), exclude_set):
-            continue
-        filtered.append(group)
-    return filtered
+def _process_env_group(item: PlannedWorkItem) -> tuple[list[AggregatedEnvRows], int]:
+    """Load and aggregate all rows for a single selected dataset."""
+    row_buffer: list[dict[str, Any]] = []
+    identities: list[RunIdentity] = []
+    for planned in item.records:
+        row_batch = rows.load_rows(
+            planned.normalized,
+            extra_columns=planned.extra_columns,
+            drop_columns=planned.drop_columns,
+            answer_column=planned.answer_column,
+        )
+        row_buffer.extend(row_batch)
+        identities.append(planned.normalized.identity)
+    aggregated = aggregate.aggregate_rows_by_env(row_buffer, identities=identities)
+    return aggregated, len(row_buffer)
 
 
-def _filter_env_groups_by_model_exclusion(
-    env_groups: Sequence[_EnvGroupSelection],
-    exclude_models: Sequence[str],
-) -> list[_EnvGroupSelection]:
-    exclude_set = normalize_model_ids(exclude_models, label="process exclude model")
-    if not exclude_set:
-        return list(env_groups)
-    filtered: list[_EnvGroupSelection] = []
-    for group in env_groups:
-        model_id = str(group.model_key or "").strip()
-        if model_is_excluded(model_id, exclude_set):
-            continue
-        filtered.append(group)
-    return filtered
+def _source_updated_at(record: discovery.RunRecord) -> str:
+    return record.manifest.updated_at or record.manifest.created_at or ""
 
 
 def _env_is_excluded(env_id: str, exclude_set: set[str]) -> bool:
@@ -526,19 +871,6 @@ def _env_is_excluded(env_id: str, exclude_set: set[str]) -> bool:
     return dataset_is_excluded(env_identifier, exclude_set, base_dataset_id=base_env_id)
 
 
-def _is_newer_timestamp(current: str, prior: str) -> bool:
-    if not prior:
-        return True if current else False
-    if not current:
-        return False
-    try:
-        current_dt = datetime.fromisoformat(current.replace("Z", "+00:00"))
-        prior_dt = datetime.fromisoformat(prior.replace("Z", "+00:00"))
-    except Exception:
-        return current != prior
-    return current_dt > prior_dt
-
-
 def _strip_env_group_rows(group: AggregatedEnvRows) -> AggregatedEnvRows:
     return AggregatedEnvRows(
         env_id=group.env_id,
@@ -550,72 +882,6 @@ def _strip_env_group_rows(group: AggregatedEnvRows) -> AggregatedEnvRows:
     )
 
 
-def _normalize_records(
-    records: Sequence[discovery.RunRecord],
-    env_export_map: Mapping[str, EnvironmentExportConfig],
-) -> list[_NormalizedRecord]:
-    normalized_records: list[_NormalizedRecord] = []
-    for record in records:
-        env_export = _resolve_env_export(record.manifest_env_id, env_export_map)
-        extra_columns = _resolve_columns(env_export.extra_columns if env_export else ())
-        drop_columns = _resolve_columns(env_export.drop_columns if env_export else ())
-        answer_column = env_export.answer_column if env_export else None
-
-        normalized = metadata.load_normalized_metadata(record)
-        model_id = normalized.model_id
-        if not model_id:
-            raise RuntimeError(
-                "Missing model_id for run "
-                f"(job_run_id={record.manifest.job_run_id}, job_id={record.job_id}, "
-                f"results_dir={record.results_dir}, manifest={record.manifest.manifest_path})"
-            )
-
-        env_key = normalized.base_env_id or normalized.manifest_env_id or record.manifest_env_id or record.job_id
-        normalized_records.append(
-            _NormalizedRecord(
-                record=record,
-                normalized=normalized,
-                extra_columns=extra_columns,
-                drop_columns=drop_columns,
-                answer_column=answer_column,
-                model_key=model_id,
-                env_key=env_key,
-                job_run_id=record.manifest.job_run_id,
-                run_timestamp=_source_updated_at(record),
-            )
-        )
-    return normalized_records
-
-
-def _select_latest_env_groups(
-    records: Sequence[_NormalizedRecord],
-) -> list[_EnvGroupSelection]:
-    env_groups: dict[tuple[str, str], dict[str, list[_NormalizedRecord]]] = {}
-    run_timestamps: dict[str, str] = {}
-    for record in records:
-        env_groups.setdefault((record.model_key, record.env_key), {}).setdefault(record.job_run_id, []).append(record)
-        run_timestamps.setdefault(record.job_run_id, record.run_timestamp)
-
-    selected: list[_EnvGroupSelection] = []
-    for (model_key, env_key), run_groups in env_groups.items():
-        if not run_groups:
-            continue
-        latest_run_id = max(
-            run_groups.keys(),
-            key=lambda run_id: _run_sort_key(run_timestamps.get(run_id, ""), run_id),
-        )
-        selected.append(
-            _EnvGroupSelection(
-                model_key=model_key,
-                env_key=env_key,
-                job_run_id=latest_run_id,
-                run_timestamp=run_timestamps.get(latest_run_id, ""),
-                records=run_groups[latest_run_id],
-            )
-        )
-    return selected
-
-
 def _run_sort_key(timestamp: str, job_run_id: str) -> tuple[int, datetime, str]:
     if not timestamp:
         return (0, datetime.min.replace(tzinfo=UTC), job_run_id)
@@ -626,21 +892,13 @@ def _run_sort_key(timestamp: str, job_run_id: str) -> tuple[int, datetime, str]:
         return (0, datetime.min.replace(tzinfo=UTC), job_run_id)
 
 
-def _confirm_clean_process(
-    output_dir: Path,
-    *,
-    assume_yes: bool,
-    is_tty: bool,
-    prompt_func: Callable[[str], str] | None,
-) -> None:
-    if assume_yes:
-        return
-    if not is_tty or prompt_func is None:
-        raise RuntimeError("Refusing to clean processed outputs without confirmation. Re-run with --yes to confirm.")
-    prompt = f"--clean will delete all contents of {output_dir} and rebuild from runs. Type 'clean' to continue: "
-    try:
-        response = prompt_func(prompt).strip().lower()
-    except (EOFError, KeyboardInterrupt):  # noqa: PERF203
-        raise RuntimeError("Aborted clean process.") from None
-    if response != "clean":
-        raise RuntimeError("Aborted clean process.")
+__all__ = [
+    "PROCESS_DEFAULT_STATUS_FILTER",
+    "PlannedRecord",
+    "PlannedWorkItem",
+    "ProcessOptions",
+    "ProcessResult",
+    "SelectionResult",
+    "run_process",
+    "select_work_items",
+]
diff --git a/medarc_verifiers/cli/process/rows.py b/medarc_verifiers/cli/process/rows.py
index d06cb1e4..e27896a7 100644
--- a/medarc_verifiers/cli/process/rows.py
+++ b/medarc_verifiers/cli/process/rows.py
@@ -28,84 +28,118 @@ def load_rows(
     """Load results.jsonl rows and attach manifest metadata."""
     record = metadata.record
     if not record.has_results:
-        logger.debug("Run %s missing results.jsonl; skipping.", record.job_id)
-        return []
+        raise FileNotFoundError(
+            "Missing results.jsonl for selected run "
+            f"(job_run_id={record.manifest.job_run_id}, job_id={record.job_id}, path={record.results_path})"
+        )
 
     results_path = record.results_path
     extras_keys = {column for column in extra_columns or () if column}
     drop = {column for column in drop_columns or () if column}
     drop.update(DEFAULT_DROP_COLUMNS)
     drop.update(PROMPT_COMPLETION_COLUMNS)
+    decoded_rows, example_counts = _decode_results_jsonl(results_path)
+    multi_rollout = _detect_multi_rollout_shape(example_counts)
+    version_info_json = _encode_metadata_json_column(metadata.raw_metadata.get("version_info"))
+
+    rows: list[dict[str, Any]] = []
+    seen_per_example: dict[Any, int] = {}
+    for line_number, payload in decoded_rows:
+        cleaned, extras = _clean_payload_row(
+            payload,
+            extras_keys=extras_keys,
+            drop=drop,
+            answer_column=answer_column,
+        )
+        rollout_index = _resolve_rollout_index(
+            payload,
+            metadata,
+            multi_rollout=multi_rollout,
+            seen_per_example=seen_per_example,
+        )
+        if extras_keys and extras:
+            cleaned["extras"] = json.dumps(extras, sort_keys=True)
+        else:
+            cleaned["extras"] = None
+        enriched = _attach_row_metadata(
+            cleaned,
+            metadata,
+            line_number=line_number,
+            rollout_index=rollout_index,
+            version_info_json=version_info_json,
+        )
+        rows.append(enriched)
 
-    # First pass: decode and clean rows, and count example_id occurrences to
-    # detect multiple rollouts within a single JSONL (example_id repetition).
+    return rows
+
+
+def _decode_results_jsonl(path: Path) -> tuple[list[tuple[int, Mapping[str, Any]]], dict[Any, int]]:
+    """Decode results.jsonl and count example_id occurrences for rollout detection."""
     decoded_rows: list[tuple[int, Mapping[str, Any]]] = []
     example_counts: dict[Any, int] = {}
     try:
-        with results_path.open("r", encoding="utf-8") as handle:
+        with path.open("r", encoding="utf-8") as handle:
             for line_number, raw_line in enumerate(handle, start=1):
                 line = raw_line.strip()
                 if not line:
                     continue
-                payload = _decode_line(line, results_path, line_number)
+                payload = _decode_line(line, path, line_number)
                 decoded_rows.append((line_number, payload))
                 ex_id = payload.get("example_id")
-                # Count occurrences to infer intra-file rollout structure.
                 try:
                     example_counts[ex_id] = example_counts.get(ex_id, 0) + 1
                 except TypeError:
-                    # Non-hashable example_id shouldn't happen (schema requires
-                    # primitive), but guard just in case.
                     pass
     except ValueError:
         raise
     except OSError as exc:  # noqa: FBT003
-        logger.warning("Failed to read %s: %s", results_path, exc)
-        return []
+        logger.warning("Failed to read %s: %s", path, exc)
+        return [], {}
+    return decoded_rows, example_counts
 
-    multi_rollout = any(count > 1 for count in example_counts.values())
-    version_info_json = _encode_metadata_json_column(metadata.raw_metadata.get("version_info"))
 
-    # Second pass: enrich rows. If the file contains multiple rollouts, compute
-    # a data-driven rollout_index by counting seen occurrences per example_id.
-    # Otherwise, retain the suffix/dir-derived rollout_index from metadata.
-    rows: list[dict[str, Any]] = []
-    seen_per_example: dict[Any, int] = {}
-    for line_number, payload in decoded_rows:
-        extras = _extract_extras(payload, extras_keys=extras_keys)
-        cleaned = _clean_row(payload, drop=drop, extras_keys=extras_keys)
-        cleaned.pop("rollout_index", None)
-        _map_answer_column(cleaned, payload, answer_column=answer_column)
-        _flatten_token_usage(cleaned)
-        payload_rollout_index = _coerce_rollout_index(payload.get("rollout_index"))
-        if payload_rollout_index is not None:
-            rollout_index = payload_rollout_index
-            cleaned["rollout_index"] = payload_rollout_index
-        elif multi_rollout:
-            ex_id = payload.get("example_id")
-            try:
-                seen = seen_per_example.get(ex_id, 0)
-                rollout_index = seen  # 0-based occurrence index
-                seen_per_example[ex_id] = seen + 1
-            except TypeError:
-                # Fallback to metadata rollout_index if example_id is unusable as key
-                rollout_index = metadata.rollout_index
-        else:
-            rollout_index = metadata.rollout_index
-        if extras_keys and extras:
-            cleaned["extras"] = json.dumps(extras, sort_keys=True)
-        else:
-            cleaned["extras"] = None
-        enriched = _attach_metadata(
-            cleaned,
-            metadata,
-            line_number=line_number,
-            rollout_index=rollout_index,
-            version_info_json=version_info_json,
-        )
-        rows.append(enriched)
+def _detect_multi_rollout_shape(example_counts: Mapping[Any, int]) -> bool:
+    return any(count > 1 for count in example_counts.values())
 
-    return rows
+
+def _clean_payload_row(
+    payload: Mapping[str, Any],
+    *,
+    extras_keys: set[str],
+    drop: set[str],
+    answer_column: str | None,
+) -> tuple[MutableMapping[str, Any], Mapping[str, Any]]:
+    extras = _extract_extras(payload, extras_keys=extras_keys)
+    cleaned = _clean_row(payload, drop=drop, extras_keys=extras_keys)
+    cleaned.pop("rollout_index", None)
+    _map_answer_column(cleaned, payload, answer_column=answer_column)
+    _normalize_token_usage(cleaned)
+    payload_rollout_index = _coerce_rollout_index(payload.get("rollout_index"))
+    if payload_rollout_index is not None:
+        cleaned["rollout_index"] = payload_rollout_index
+    return cleaned, extras
+
+
+def _resolve_rollout_index(
+    payload: Mapping[str, Any],
+    metadata: NormalizedMetadata,
+    *,
+    multi_rollout: bool,
+    seen_per_example: MutableMapping[Any, int],
+) -> int:
+    payload_rollout_index = _coerce_rollout_index(payload.get("rollout_index"))
+    if payload_rollout_index is not None:
+        return payload_rollout_index
+    if not multi_rollout:
+        return metadata.rollout_index
+
+    ex_id = payload.get("example_id")
+    try:
+        seen = seen_per_example.get(ex_id, 0)
+        seen_per_example[ex_id] = seen + 1
+        return seen
+    except TypeError:
+        return metadata.rollout_index
 
 
 def _map_answer_column(
@@ -202,7 +236,7 @@ def _coerce_rollout_index(value: Any) -> int | None:
     return None
 
 
-def _attach_metadata(
+def _attach_row_metadata(
     row: MutableMapping[str, Any],
     metadata: NormalizedMetadata,
     *,
@@ -211,19 +245,18 @@ def _attach_metadata(
     version_info_json: str | None,
 ) -> MutableMapping[str, Any]:
     record = metadata.record
+    identity = metadata.identity
 
     error_value = record.reason if record.status == "failed" else None
 
-    env_identifier = metadata.base_env_id or metadata.manifest_env_id
-
     row.update(
         {
-            "env_id": env_identifier,
-            "manifest_env_id": metadata.manifest_env_id,
-            "base_env_id": metadata.base_env_id,
+            "env_id": identity.output_env_id,
+            "manifest_env_id": identity.manifest_env_id,
+            "base_env_id": identity.base_env_id,
             "job_run_id": record.manifest.job_run_id,
             "run_id": record.job_id,
-            "model_id": metadata.model_id,
+            "model_id": identity.model_id,
             "version_info": version_info_json,
             "status": record.status,
             "error": error_value,
@@ -236,7 +269,7 @@ def _attach_metadata(
     return row
 
 
-def _flatten_token_usage(row: MutableMapping[str, Any]) -> None:
+def _normalize_token_usage(row: MutableMapping[str, Any]) -> None:
     """Flatten token_usage dict into explicit columns and drop the original field."""
     if "token_usage" not in row:
         return
diff --git a/medarc_verifiers/cli/process/workspace.py b/medarc_verifiers/cli/process/workspace.py
index 20254104..d5669ff5 100644
--- a/medarc_verifiers/cli/process/workspace.py
+++ b/medarc_verifiers/cli/process/workspace.py
@@ -3,14 +3,17 @@
 from __future__ import annotations
 
 import json
+import logging
 import shutil
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Callable, Iterable, Sequence
 
-from medarc_verifiers.cli.hf import HFSyncConfig, download_hf_repo
+from medarc_verifiers.cli.hf import HFSyncConfig, compute_pending_parquet_uploads, download_hf_repo
 from medarc_verifiers.utils.pathing import resolve_under
 
+logger = logging.getLogger(__name__)
+
 
 @dataclass(slots=True)
 class BaselineResult:
@@ -18,9 +21,16 @@ class BaselineResult:
     files_copied: list[Path] = field(default_factory=list)
     files_overwritten: list[Path] = field(default_factory=list)
     files_skipped: list[Path] = field(default_factory=list)
+    pending_parquet_uploads: set[str] = field(default_factory=set)
     snapshot_dir: Path | None = None
 
 
+@dataclass(slots=True)
+class WorkspacePreparationResult:
+    cleaned: bool = False
+    baseline_result: BaselineResult | None = None
+
+
 def ensure_output_dir(output_dir: Path) -> None:
     output_dir.mkdir(parents=True, exist_ok=True)
 
@@ -33,6 +43,37 @@ def is_nonempty_dir(path: Path) -> bool:
     return False
 
 
+def prepare_output_workspace(
+    *,
+    output_dir: Path,
+    hf_config: HFSyncConfig | None,
+    pull_policy: str | None,
+    clean: bool,
+    assume_yes: bool,
+    is_tty: bool,
+    prompt_func: Callable[[str], str] | None = None,
+) -> WorkspacePreparationResult:
+    """Prepare local processed outputs before selection reads local inventory state."""
+    ensure_output_dir(output_dir)
+
+    if clean:
+        confirm_clean_output_dir(output_dir, assume_yes=assume_yes, is_tty=is_tty, prompt_func=prompt_func)
+        clear_output_dir(output_dir)
+        return WorkspacePreparationResult(cleaned=True)
+
+    if hf_config and hf_config.repo_id:
+        baseline_result = prepare_hf_baseline(
+            output_dir=output_dir,
+            hf_config=hf_config,
+            pull_policy=pull_policy,
+            is_tty=is_tty,
+            prompt_func=prompt_func,
+        )
+        return WorkspacePreparationResult(cleaned=False, baseline_result=baseline_result)
+
+    return WorkspacePreparationResult(cleaned=False)
+
+
 def prepare_hf_baseline(
     *,
     output_dir: Path,
@@ -47,8 +88,10 @@ def prepare_hf_baseline(
         return BaselineResult(policy="local")
 
     policy = _resolve_pull_policy(pull_policy, is_tty=is_tty)
-    result = BaselineResult(policy=policy)
     if not is_nonempty_dir(output_dir):
+        if policy == "continue-upload":
+            logger.warning("HF continue-upload requested with an empty output dir; falling back to pull.")
+        result = BaselineResult(policy="pull" if policy == "continue-upload" else policy)
         snapshot_dir = download_hf_repo(
             repo_id=hf_config.repo_id,
             branch=hf_config.branch,
@@ -61,10 +104,36 @@ def prepare_hf_baseline(
         _copy_snapshot(snapshot_dir, output_dir, result, overwrite=True)
         return result
 
+    result = BaselineResult(policy=policy)
+    if policy == "prompt":
+        try:
+            result.pending_parquet_uploads = compute_pending_parquet_uploads(
+                output_dir=output_dir,
+                repo_id=hf_config.repo_id,
+                branch=hf_config.branch,
+                token=hf_config.token,
+            )
+        except Exception as exc:  # noqa: BLE001
+            logger.warning("HF upload recovery check failed before prompt; hiding upload option: %s", exc)
+    elif policy == "continue-upload":
+        try:
+            result.pending_parquet_uploads = compute_pending_parquet_uploads(
+                output_dir=output_dir,
+                repo_id=hf_config.repo_id,
+                branch=hf_config.branch,
+                token=hf_config.token,
+            )
+        except Exception as exc:  # noqa: BLE001
+            logger.warning("HF upload recovery check failed for continue-upload; uploading only current touched files: %s", exc)
+
     prompt_conflicts = False
     if policy == "prompt":
-        choice = _prompt_baseline_choice(prompt_func, is_tty=is_tty)
-        policy = choice
+        choice = _prompt_baseline_choice(
+            prompt_func,
+            is_tty=is_tty,
+            show_upload=bool(result.pending_parquet_uploads),
+        )
+        policy = "continue-upload" if choice == "upload" else choice
         result.policy = policy
         prompt_conflicts = policy == "pull"
 
@@ -104,26 +173,59 @@ def prepare_hf_baseline(
         )
         return result
 
+    if policy == "continue-upload":
+        return result
+
     raise ValueError(f"Unsupported HF pull policy: {policy}")
 
 
+def confirm_clean_output_dir(
+    output_dir: Path,
+    *,
+    assume_yes: bool,
+    is_tty: bool,
+    prompt_func: Callable[[str], str] | None,
+) -> None:
+    if assume_yes:
+        return
+    if not is_tty or prompt_func is None:
+        raise RuntimeError("Refusing to clean processed outputs without confirmation. Re-run with --yes to confirm.")
+    prompt = f"--clean will delete all contents of {output_dir} and rebuild from runs. Type 'clean' to continue: "
+    try:
+        response = prompt_func(prompt).strip().lower()
+    except (EOFError, KeyboardInterrupt):  # noqa: PERF203
+        raise RuntimeError("Aborted clean process.") from None
+    if response != "clean":
+        raise RuntimeError("Aborted clean process.")
+
+
 def _resolve_pull_policy(pull_policy: str | None, *, is_tty: bool) -> str:
     if pull_policy:
         return pull_policy
     return "prompt" if is_tty else "pull"
 
 
-def _prompt_baseline_choice(prompt_func: Callable[[str], str] | None, *, is_tty: bool) -> str:
+def _prompt_baseline_choice(
+    prompt_func: Callable[[str], str] | None,
+    *,
+    is_tty: bool,
+    show_upload: bool = False,
+) -> str:
     if not is_tty or prompt_func is None:
         return "pull"
+    choices = ["pull", "clean"]
+    if show_upload:
+        choices.append("upload")
     if prompt_func is not input:
-        prompt = (
+        prompt_lines = [
             "HF baseline exists locally.\n"
-            "  pull  -> download missing data without deleting local files\n"
-            "  clean -> redownload everything after deleting local files\n"
-            "Choose [pull/clean]: "
-        )
-        return _read_choice(prompt_func, prompt, {"pull", "clean"})
+            "  pull   -> download missing data without deleting local files\n"
+            "  clean  -> redownload everything after deleting local files\n"
+        ]
+        if show_upload:
+            prompt_lines.append("  upload -> keep local files and resume/upload pending HF artifacts\n")
+        prompt_lines.append(f"Choose [{'/'.join(choices)}]: ")
+        return _read_choice(prompt_func, "".join(prompt_lines), choices)
     from rich.console import Console
     from rich.prompt import Prompt
 
@@ -131,7 +233,12 @@ def _prompt_baseline_choice(prompt_func: Callable[[str], str] | None, *, is_tty:
     console.print("[bold yellow]HF baseline exists locally.[/bold yellow]")
     console.print("  [cyan]pull[/cyan]  -> download missing data without deleting local files")
     console.print("  [cyan]clean[/cyan] -> redownload everything after deleting local files")
-    return Prompt.ask("Choose", choices=["pull", "clean"], default="pull")
+    if show_upload:
+        console.print("  [cyan]upload[/cyan] -> keep local files and resume/upload pending HF artifacts")
+    try:
+        return Prompt.ask("Choose", choices=choices, default="pull")
+    except (EOFError, KeyboardInterrupt):  # noqa: PERF203
+        raise RuntimeError("Aborted HF baseline selection.") from None
 
 
 def _prompt_overwrite_file(prompt_func: Callable[[str], str] | None, *, path: Path, is_tty: bool) -> bool:
@@ -147,7 +254,7 @@ def _read_choice(prompt_func: Callable[[str], str], prompt: str, choices: Sequen
     while True:
         try:
             response = prompt_func(prompt).strip().lower()
-        except EOFError:  # noqa: PERF203
+        except (EOFError, KeyboardInterrupt):  # noqa: PERF203
             raise RuntimeError("Aborted HF baseline selection.") from None
         if response in choices_set:
             return response
@@ -228,8 +335,11 @@ def clear_output_dir(output_dir: Path) -> None:
 
 __all__ = [
     "BaselineResult",
+    "WorkspacePreparationResult",
     "clear_output_dir",
+    "confirm_clean_output_dir",
     "ensure_output_dir",
     "is_nonempty_dir",
+    "prepare_output_workspace",
     "prepare_hf_baseline",
 ]
diff --git a/medarc_verifiers/cli/process/writer.py b/medarc_verifiers/cli/process/writer.py
index 1b9d4d55..a9256cdb 100644
--- a/medarc_verifiers/cli/process/writer.py
+++ b/medarc_verifiers/cli/process/writer.py
@@ -51,7 +51,7 @@
 EXPECTED_POLARS_DTYPES: dict[str, pl.DataType] = {
     "env_id": pl.String,
     "error": pl.String,
-    "example_id": pl.Int64,
+    "example_id": pl.String,
     "answer": pl.String,
     "extras": pl.String,
     "generation_ms": pl.Float64,
@@ -79,7 +79,7 @@
     [
         pa.field("env_id", pa.large_string()),
         pa.field("error", pa.large_string()),
-        pa.field("example_id", pa.int64()),
+        pa.field("example_id", pa.large_string()),
         pa.field("answer", pa.large_string()),
         pa.field("extras", pa.large_string()),
         pa.field("generation_ms", pa.float64()),
diff --git a/medarc_verifiers/cli/winrate/api.py b/medarc_verifiers/cli/winrate/api.py
index d44ad2df..88f375cb 100644
--- a/medarc_verifiers/cli/winrate/api.py
+++ b/medarc_verifiers/cli/winrate/api.py
@@ -64,6 +64,28 @@ class ModelCentricResult:
     datasets: dict[str, dict[str, Any]]
 
 
+@dataclass(slots=True)
+class DatasetModelMissingness:
+    """Missing reward coverage for one (dataset, model) pair."""
+
+    dataset: str
+    model: str
+    expected_n: int
+    present_nonnull_n: int
+    missing_count: int
+    missing_pct: float
+
+
+@dataclass(slots=True)
+class MissingnessSummary:
+    """Aggregate missingness summary across retained datasets."""
+
+    n_pairs_total: int
+    n_pairs_with_missing: int
+    missing_cells_total: int
+    worst_offenders: list[DatasetModelMissingness]
+
+
 def read_dataset_lazy(
     parquet_path: Path | str | Sequence[Path | str | PLDataFrame | PLLazyFrame] | PLDataFrame | PLLazyFrame,
 ) -> pl.LazyFrame:
@@ -288,6 +310,7 @@ def compute_winrates(
     n_questions_by_ds: dict[str, int] = {}
     models_by_ds: dict[str, list[str]] = {}
     models_present_by_ds: dict[str, set[str]] = {}
+    missingness_by_ds: dict[str, list[DatasetModelMissingness]] = {}
     seen_models: set[str] = set()
     seen_model_case_map: dict[str, str] = {}
 
@@ -300,7 +323,7 @@ def compute_winrates(
         dataset_iter = datasets
 
     for dataset_name, parquet_path in dataset_iter:
-        stats, models_present = _process_dataset(
+        stats, models_present, missingness = _process_dataset(
             dataset_name,
             parquet_path,
             cfg,
@@ -329,6 +352,7 @@ def compute_winrates(
         avg_rewards_by_dataset[dataset_name] = stats.avg_reward_per_model
         n_questions_by_ds[dataset_name] = stats.n_questions
         models_by_ds[dataset_name] = stats.models
+        missingness_by_ds[dataset_name] = missingness
 
     if not known_model_set:
         if include_set:
@@ -349,6 +373,7 @@ def compute_winrates(
         per_dataset_model_means=per_dataset_model_means,
         avg_rewards_by_dataset=avg_rewards_by_dataset,
         models_by_ds=models_by_ds,
+        missingness_by_ds=missingness_by_ds,
         include_map=include_map,
         seen_model_case_map=seen_model_case_map,
     )
@@ -373,6 +398,7 @@ def compute_winrates(
                 avg_rewards_by_dataset.pop(dataset_name, None)
                 n_questions_by_ds.pop(dataset_name, None)
                 models_by_ds.pop(dataset_name, None)
+                missingness_by_ds.pop(dataset_name, None)
             if not per_dataset_pairwise:
                 _raise_user_error(
                     "No datasets remain after enforcing dataset_coverage=all-models. "
@@ -385,6 +411,8 @@ def compute_winrates(
                 coverage=dataset_coverage,
             )
 
+    _emit_missingness_report(_summarize_missingness(missingness_by_ds))
+
     return build_model_centric_result(
         per_dataset_pairwise=per_dataset_pairwise,
         per_dataset_model_means=per_dataset_model_means,
@@ -583,7 +611,7 @@ def _process_dataset(
     include_map: Mapping[str, str],
     seen_model_case_map: Mapping[str, str],
     partial_datasets: str,
-) -> tuple[DatasetStats | None, list[str]]:
+) -> tuple[DatasetStats | None, list[str], list[DatasetModelMissingness]]:
     """Read and process a dataset, raising on failure and honoring selection policies."""
     try:
         lf = read_dataset_lazy(parquet_path)
@@ -599,7 +627,7 @@ def _process_dataset(
             if missing_required and partial_datasets == "strict":
                 missing_labels = [include_map.get(model, model) for model in missing_required]
                 _emit_note(f"Dropping dataset {dataset_name} (missing include models: {missing_labels}).")
-                return None, models_present
+                return None, models_present, []
 
         if include_set:
             models_filtered = [models_present_map[model] for model in target_models if model in models_present_map]
@@ -648,6 +676,7 @@ def canonical_label(normalized_id: str) -> str:
             else:
                 pairwise[key] = (1.0 - wr, n_used)
         avg_reward_per_model = _mean_reward_per_model(df_avg, allowed=models)
+        missingness = _compute_dataset_missingness(dataset_name, df_filtered, models)
         return (
             DatasetStats(
                 pairwise=pairwise,
@@ -656,12 +685,54 @@ def canonical_label(normalized_id: str) -> str:
                 avg_reward_per_model=avg_reward_per_model,
             ),
             models_present,
+            missingness,
         )
     except Exception as exc:  # noqa: BLE001
         message = f"Failed to process dataset {dataset_name} at {_format_parquet_source(parquet_path)}: {exc}"
         _raise_user_error(message, exc)
 
 
+def _compute_dataset_missingness(
+    dataset_name: str,
+    df_avg: pl.DataFrame,
+    models: Sequence[str],
+) -> list[DatasetModelMissingness]:
+    deduped_models = list(dict.fromkeys(str(model) for model in models))
+    if not deduped_models:
+        return []
+
+    expected_n = 0
+    present_nonnull_by_model: dict[str, int] = {}
+    if not df_avg.is_empty() and EXAMPLE_ID_COL in df_avg.columns:
+        expected_n = int(df_avg.select(pl.col(EXAMPLE_ID_COL).n_unique()).item())
+        if MODEL_COL in df_avg.columns:
+            grouped = (
+                df_avg.filter(pl.col("reward_mean").is_not_null())
+                .group_by(MODEL_COL)  # type: ignore[arg-type]
+                .agg(pl.col(EXAMPLE_ID_COL).n_unique().alias("present_nonnull_n"))
+            )
+            present_nonnull_by_model = {
+                str(model): int(present_nonnull or 0) for model, present_nonnull in grouped.iter_rows()
+            }
+
+    missingness: list[DatasetModelMissingness] = []
+    for model in deduped_models:
+        present_nonnull_n = max(present_nonnull_by_model.get(model, 0), 0)
+        missing_count = max(expected_n - present_nonnull_n, 0)
+        missing_pct = (100.0 * missing_count / expected_n) if expected_n > 0 else 0.0
+        missingness.append(
+            DatasetModelMissingness(
+                dataset=dataset_name,
+                model=model,
+                expected_n=expected_n,
+                present_nonnull_n=present_nonnull_n,
+                missing_count=missing_count,
+                missing_pct=missing_pct,
+            )
+        )
+    return missingness
+
+
 def _mean_reward_per_model(df_avg: pl.DataFrame, allowed: Sequence[str] | None = None) -> dict[str, float | None]:
     """Average reward_mean per model inside a dataset."""
     if df_avg.is_empty() or MODEL_COL not in df_avg.columns:
@@ -745,6 +816,7 @@ def _canonicalize_dataset_model_labels(
     per_dataset_model_means: dict[str, dict[str, float]],
     avg_rewards_by_dataset: dict[str, dict[str, float | None]],
     models_by_ds: dict[str, list[str]],
+    missingness_by_ds: dict[str, list[DatasetModelMissingness]],
     include_map: Mapping[str, str],
     seen_model_case_map: Mapping[str, str],
 ) -> None:
@@ -806,6 +878,70 @@ def canonical(value: str) -> str:
             deduped.append(canonical_model)
         models_by_ds[dataset] = deduped
 
+    for dataset, rows in list(missingness_by_ds.items()):
+        canonical_rows: list[DatasetModelMissingness] = []
+        for row in rows:
+            canonical_rows.append(
+                DatasetModelMissingness(
+                    dataset=row.dataset,
+                    model=canonical(row.model),
+                    expected_n=row.expected_n,
+                    present_nonnull_n=row.present_nonnull_n,
+                    missing_count=row.missing_count,
+                    missing_pct=row.missing_pct,
+                )
+            )
+        missingness_by_ds[dataset] = canonical_rows
+
+
+def _summarize_missingness(
+    missingness_by_ds: Mapping[str, Sequence[DatasetModelMissingness]],
+) -> MissingnessSummary:
+    rows = [row for dataset_rows in missingness_by_ds.values() for row in dataset_rows]
+    rows_with_missing = [row for row in rows if row.missing_count > 0]
+    worst_offenders = sorted(
+        rows_with_missing,
+        key=lambda row: (-row.missing_pct, -row.missing_count, row.dataset, row.model),
+    )[:10]
+    return MissingnessSummary(
+        n_pairs_total=len(rows),
+        n_pairs_with_missing=len(rows_with_missing),
+        missing_cells_total=sum(row.missing_count for row in rows),
+        worst_offenders=worst_offenders,
+    )
+
+
+def _emit_missingness_report(summary: MissingnessSummary) -> None:
+    logger.info(
+        "Winrate missingness summary: n_pairs_total=%d n_pairs_with_missing=%d missing_cells_total=%d",
+        summary.n_pairs_total,
+        summary.n_pairs_with_missing,
+        summary.missing_cells_total,
+    )
+    console = _get_console()
+    if not console or not getattr(console, "is_terminal", False) or not summary.worst_offenders:
+        return
+    try:
+        from rich.table import Table
+    except Exception:
+        return
+
+    table = Table(title="Winrate missingness (top offenders)")
+    table.add_column("dataset", style="cyan")
+    table.add_column("model", style="magenta")
+    table.add_column("missing", justify="right")
+    table.add_column("expected", justify="right")
+    table.add_column("missing %", justify="right")
+    for row in summary.worst_offenders:
+        table.add_row(
+            row.dataset,
+            row.model,
+            str(row.missing_count),
+            str(row.expected_n),
+            f"{row.missing_pct:.1f}",
+        )
+    console.print(table)
+
 
 def _format_parquet_source(
     parquet_path: Path | str | Sequence[Path | str] | PLDataFrame | PLLazyFrame,
diff --git a/medarc_verifiers/parsers/xml_parser.py b/medarc_verifiers/parsers/xml_parser.py
index 6a1176fc..6eb1f2e6 100644
--- a/medarc_verifiers/parsers/xml_parser.py
+++ b/medarc_verifiers/parsers/xml_parser.py
@@ -61,6 +61,25 @@ def parse(self, completion: Messages | str, strip: bool = True, last: bool = Fal
                 return parsed
         return None
 
+    def parse_answer(self, completion: Messages | str) -> str | None:
+        """Extract the last answer field from a completion."""
+        if isinstance(completion, str):
+            parsed = self.parse(completion, last=True)
+            if parsed is not None and hasattr(parsed, self.answer_field):
+                value = getattr(parsed, self.answer_field)
+                if value is not None:
+                    return value
+            return None
+
+        for msg in reversed(self.get_assistant_messages(completion)):
+            content = str(msg.get("content", ""))
+            parsed = self.parse(content, last=True)
+            if parsed is not None and hasattr(parsed, self.answer_field):
+                value = getattr(parsed, self.answer_field)
+                if value is not None:
+                    return value
+        return None
+
     def _has_any_field(self, parsed: Any) -> bool:
         for _, alternatives in self._fields:
             for alt in alternatives:
diff --git a/medarc_verifiers/rewards/multiple_choice_accuracy.py b/medarc_verifiers/rewards/multiple_choice_accuracy.py
index 71e123a8..cdee4780 100644
--- a/medarc_verifiers/rewards/multiple_choice_accuracy.py
+++ b/medarc_verifiers/rewards/multiple_choice_accuracy.py
@@ -1,76 +1,190 @@
-"""
-LLM multiple-choice question accuracy reward.
+"""MCQ raw-text grading with tail-authoritative long-response handling."""
 
-Main use case: Handle models that either return the letter/number (preferred)
-or return the entire answer text verbatim (fallback).
-
-Supports chain-of-thought by prioritizing anchored patterns like "answer is X"
-before falling back to last token or text matching. Attempts to recognize
-negations to avoid false positives (e.g., "the answer is not C").
-"""
+from __future__ import annotations
 
 import re
 import unicodedata
 from dataclasses import dataclass
+from functools import lru_cache
 from typing import Optional
 
 
+# Responses longer than this switch into tail long-mode behavior.
+LONG_RESPONSE_THRESHOLD_CHARS = 4_000
+# Long-mode explicit-answer and answer-text scans are limited to this terminal slice.
+TERMINAL_WINDOW_CHARS = 4_000
+# The looser last-token fallback only inspects this shorter tail inside the terminal slice.
+STRONG_TAIL_WINDOW_CHARS = 2_000
+# Local ambiguity checks can look this far backward from a candidate.
+LOCAL_CONTEXT_BEFORE_CHARS = 160
+# Local ambiguity checks can look this far forward from a candidate.
+LOCAL_CONTEXT_AFTER_CHARS = 240
+# Tail-choice fallback is only allowed when the trailing segment is this short or shorter.
+TAIL_CHOICE_MAX_WORDS = 16
+
+_UNICODE_PUNCT_TRANSLATIONS = str.maketrans(
+    {
+        "\u00a0": " ",
+        "\u2010": "-",
+        "\u2011": "-",
+        "\u2012": "-",
+        "\u2013": "-",
+        "\u2014": "-",
+        "\u2015": "-",
+        "\u2212": "-",
+        "\u2018": "'",
+        "\u2019": "'",
+        "\u201c": '"',
+        "\u201d": '"',
+    }
+)
+
+_WHITESPACE_RE = re.compile(r"\s+")
+_LIKELY_TEX_RE = re.compile(r"\\[A-Za-z]+|\\[$\\()\\[\\]{}]|[$]")
+_THINK_OPEN_RE = re.compile(r"<\s*think\b[^>]*>", re.IGNORECASE)
+_THINK_CLOSE_RE = re.compile(r"</\s*think\s*>", re.IGNORECASE)
+_ANSWER_TAG_RE = re.compile(r"</?\s*answer\s*>", re.IGNORECASE)
+
+# Any standalone option-like token. This is intentionally broad and gets filtered by
+# local ambiguity checks before it can count as a chosen answer.
+_OPTION_TOKEN_RE = re.compile(r"(?<![\w+\-/])(?P<opt>[A-Za-z]|\d{1,2})(?![\w+\-/])", re.IGNORECASE)
+# Anchored cues that usually indicate the model is committing to a final answer.
+_ANCHOR_RE = re.compile(
+    r"(?P<label>\bfinal\s+answer\b|\bthe\s+correct\s+answer\b|\bcorrect\s+answer\b|\bthe\s+answer\b|\banswer\b|\btherefore\b|\bi\s+choose\b)"
+    r"\s*[:\-]?\s*(?:is\s+)?(?P<neg>not\s+|isn't\s+|isnt\s+)?(?:(?:option|choice)\s+)?"
+    r"(?:[*_`~]+\s*)*(?:\\boxed\{\s*)?[\(\[\{<【]*\s*(?P<opt>[A-Za-z]|\d{1,2})\s*"
+    r"[\)\]\}>】]*\s*(?:\}\s*)?(?:[*_`~]+\s*)?(?![\w+\-/])",
+    re.IGNORECASE,
+)
+# Option-led lines like "B. Answer text" or "**(2)** Answer text".
+_LEADING_OPTION_RE = re.compile(
+    r"^\s*(?:>\s*)?(?:(?:[-*+]\s+)|(?:\d{1,3}[.)]\s+))?"
+    r"(?:[*_`~]+\s*)?(?:\\boxed\{\s*)?[\(\[\{<【]*\s*(?P<opt>[A-Za-z]|\d{1,2})\s*"
+    r"[\)\]\}>】]*\s*(?:\}\s*)?\s*(?:[).:\-])?\s*(?:[*_`~]+\s*)*\s+(?P<rest>.+?)\s*$",
+    re.IGNORECASE,
+)
+_SENTENCE_OPTION_START_RE = re.compile(
+    r"^\s*(?:>\s*)?(?:(?:[-*+]\s+)|(?:\d{1,3}[.)]\s+))?"
+    r"(?:[*_`~]+\s*)*(?:\\boxed\{\s*)?[\(\[\{<【]*\s*(?P<opt>[A-Za-z]|\d{1,2})\s*"
+    r"[\)\]\}>】]*\s*(?:\}\s*)?\s*(?:[).:\-])",
+    re.IGNORECASE,
+)
+_EXACT_OPTION_RE = re.compile(r"^\s*(?:(?:option|choice)\s+)?(?P<opt>[A-Za-z]|\d{1,2})\s*[.!?]?\s*$", re.IGNORECASE)
+_TERMINAL_OPTION_LINE_RE = re.compile(
+    r"^\s*(?:>\s*)?(?:(?:[-*+]\s+)|(?:\d{1,3}[.)]\s+))?"
+    r"(?:[*_`~]+\s*)?(?:(?:option|choice)\s+)?(?:\\boxed\{\s*)?[\(\[\{<【]*\s*(?P<opt>[A-Za-z]|\d{1,2})\s*"
+    r"[\)\]\}>】]*\s*(?:\}\s*)?(?:[*_`~]+\s*)?\s*[.!?]?\s*$",
+    re.IGNORECASE,
+)
+# Used by the tail-choice fallback after a short trailing segment has been extracted.
+_TAIL_CHOICE_OPTION_RE = re.compile(
+    r"(?<![\w+\-/])(?:(?:option|choice)\s+)?(?:\\boxed\{\s*)?[\(\[\{<【]*\s*(?P<opt>[A-Za-z]|\d{1,2})\s*"
+    r"[\)\]\}>】]*\s*(?:\}\s*)?(?:[*_`~]+\s*)?\s*[.!?]?\s*$",
+    re.IGNORECASE,
+)
+_NEGATION_PREFIX_RE = re.compile(r"\b(?:not|isn't|isnt|wrong|incorrect|false)\b(?:\s+\w+){0,3}\s*$", re.IGNORECASE)
+_BULLET_OR_LIST_LINE_RE = re.compile(r"^\s*(?:>\s*)?(?:[-*+]\s+|\d{1,3}[.)]\s+)")
+
+_OUTER_WRAPPER_PAIRS = (
+    ('"', '"'),
+    ("'", "'"),
+    ("\u201c", "\u201d"),
+    ("\u2018", "\u2019"),
+    ("(", ")"),
+    ("[", "]"),
+    ("{", "}"),
+    ("<", ">"),
+    ("【", "】"),
+)
+_OUTER_MARKERS = ("**", "__", "*", "_", "`")
+_AFTER_REJECTION_PREFIXES = (
+    " is incorrect",
+    " is wrong",
+    " is false",
+    " is not correct",
+    " isn't correct",
+    " isnt correct",
+)
+_CONTRAST_HINTS = (" but ", " however ", " instead ", " actually ", " rather ")
+_COMPACT_OPTION_CONNECTORS = {"and", "or", "ou", "y", "e", "nor", "plus", "versus", "vs", "instead"}
+
+
 @dataclass
 class MCQAccuracyResult:
-    """Result of multiple-choice accuracy grading."""
+    """Detailed MCQ grading result."""
 
     is_correct: bool
-    """Whether the answer was graded as correct."""
+    method: str
+    matched_answer: Optional[str] = None
+    correct_answer: Optional[str] = None
+
 
+@dataclass
+class _Candidate:
+    """Normalized option candidate extracted from some region of the response."""
+
+    token: str
+    start: int
+    end: int
     method: str
-    """Method used for grading: 'direct_answer', 'anchored_token', 'last_token', 'answer_text', or 'none'."""
 
-    matched_answer: Optional[str] = None
-    """The extracted answer if found, otherwise None."""
 
-    correct_answer: Optional[str] = None
-    """The correct answer for reference, if available."""
+def normalize_for_structure(text: str) -> str:
+    """Canonicalize structure while preserving line breaks and token boundaries."""
+    text = unicodedata.normalize("NFKC", text or "")
+    text = text.translate(_UNICODE_PUNCT_TRANSLATIONS)
+    return text.casefold()
+
+
+def normalize_for_match(text: str) -> str:
+    """Canonicalize text for exact answer-text comparisons."""
+    return _WHITESPACE_RE.sub(" ", normalize_for_structure(text)).strip()
 
 
-def _nfkc_casefold(text: str) -> str:
-    """Unicode normalize + casefold for robust text comparison."""
-    return unicodedata.normalize("NFKC", text or "").casefold()
+def normalize_for_answer_text_match(text: str) -> str:
+    """Canonicalize answer text under the explicit punctuation-normalization policy."""
+    text = normalize_for_match(_strip_outer_wrappers(text))
+    return text.rstrip(".,:;!?").strip()
 
 
-def _normalize_spaces(text: str) -> str:
-    """Collapse multiple whitespace to single space."""
-    return re.sub(r"\s+", " ", text).strip()
+def _answer_text_supports_fallback(answer_text: str) -> bool:
+    """Reserve answer-text fallback for real text, not bare option labels like `A` or `2`."""
+    return bool(answer_text) and _norm_option(answer_text) is None
+
+
+@lru_cache(maxsize=1)
+def _latex_to_text_converter():
+    """Lazily construct the LaTeX-to-text converter used by `_strip_tex()`."""
+    from pylatexenc.latex2text import LatexNodes2Text
+
+    return LatexNodes2Text(math_mode="text")
 
 
 def _strip_tex(text: str) -> str:
-    """Remove LaTeX formatting if pylatexenc is available."""
-    try:
-        from pylatexenc.latex2text import LatexNodes2Text
+    """Best-effort LaTeX cleanup, leaving the original text on any failure."""
+    if not text or not _LIKELY_TEX_RE.search(text):
+        return text
 
-        return LatexNodes2Text(math_mode="text").latex_to_text(text)
+    try:
+        return _latex_to_text_converter().latex_to_text(text)
     except Exception:
         return text
 
 
-def _norm_letter(letter: str) -> Optional[str]:
-    """Normalize a token to uppercase letter or digit string."""
-    letter = (letter or "").strip()
-    if not letter:
+def _norm_option(token: str) -> Optional[str]:
+    """Normalize a predicted option to uppercase letter or digit string."""
+    token = (token or "").strip()
+    if not token:
         return None
-    if letter.isdigit():
-        return letter
-    if letter.isalpha() and len(letter) == 1:
-        return letter.upper()
+    if token.isdigit():
+        return token
+    if token.isalpha() and len(token) == 1:
+        return token.upper()
     return None
 
 
-def _token_kind_matches_answer_letter(predicted: Optional[str], answer_letter: str) -> bool:
-    """Return True if predicted token type matches the task's option type.
-
-    This prevents cases like '<answer>20' in a letter-based task (answer_letter='C')
-    from being treated as an explicit option selection, which would incorrectly disable
-    answer_text fallback.
-    """
+def _option_kind_matches(predicted: Optional[str], answer_letter: str) -> bool:
+    """Require letter answers to match letters and numeric answers to match numbers."""
     if predicted is None:
         return False
     if answer_letter.isdigit():
@@ -78,131 +192,574 @@ def _token_kind_matches_answer_letter(predicted: Optional[str], answer_letter: s
     return predicted.isalpha()
 
 
-_THINK_OPEN_RE = re.compile(r"<think>", re.IGNORECASE)
-_THINK_CLOSE_RE = re.compile(r"</think>", re.IGNORECASE)
-_THINK_PAIR_RE = re.compile(r"<think>.*?</think>", re.DOTALL | re.IGNORECASE)
-
-
-def _remove_think_tags(completion_text: str) -> str:
-    """Extract the answer section from completion text, handling think tags properly.
-
-    Behavior is intentionally conservative:
-    - If there is exactly one well-formed <think>...</think> pair AND no unclosed <think> later,
-      return everything after that closing tag.
-    - Otherwise, return the full response.
-    """
-    text = completion_text or ""
-
-    # Fast path: most outputs won't contain think tags.
-    # Some models emit an unpaired closing tag (</think>) and then the final answer.
-    # In that case, treat the closing tag as the end of reasoning and keep only the tail.
-    if _THINK_OPEN_RE.search(text) is None:
-        closes = list(_THINK_CLOSE_RE.finditer(text))
-        if closes:
-            return text[closes[-1].end() :].strip()
-        return text.strip()
-
-    # Count properly closed pairs, but stop early once we know there are 2+.
-    it = _THINK_PAIR_RE.finditer(text)
-    first = next(it, None)
-    if first is None:
-        return text.strip()
-    if next(it, None) is not None:
-        return text.strip()
-
-    return text[first.end() :].strip()
-
-
-# Anchored patterns like "final answer: C" or "the answer is D"
-ANCHOR_PATTERN = re.compile(
-    r"(?:\bfinal\s+answer\b|\banswer\b|\bans\b|\bchoice\b|\boption\b|\bselected\b|\bi\s+choose\b|\bi\s+pick\b|\btherefore\b|\bthus\b|\bso\b|\bconclusion\b|\bin\s+conclusion\b|\bmost\s+likely\b|\bbest[-\s]+supported\s+answer\b|<answer>)\s*"
-    r"[:\-–—]?\s*(?:is\s*)?(?P<neg>not\s+|isn['’]t\s+)?"
-    r"(?:[*_`~]+\s*)*"  # allow markdown wrappers before the option
-    r"\(?\s*(?P<opt>[A-Za-z]|\d{1,2})\s*\)?"  # option token, possibly parenthesized
-    r"\s*[\)\.:]?\s*"  # optional delimiter (e.g., 'B.' or 'B)')
-    r"(?:[*_`~]+\s*)*"  # allow markdown wrappers after the option
-    r"(?![\w+\-/])",
-    re.IGNORECASE,
-)
+def _result(
+    is_correct: bool,
+    method: str,
+    predicted: Optional[str],
+    actual: Optional[str],
+    return_details: bool,
+) -> bool | MCQAccuracyResult:
+    """Return either a bare boolean or the structured grading result."""
+    if not return_details:
+        return is_correct
+    return MCQAccuracyResult(
+        is_correct=is_correct,
+        method=method,
+        matched_answer=predicted,
+        correct_answer=actual,
+    )
+
+
+def _remove_think_tags(text: str) -> str:
+    """Drop internal reasoning and keep only the answer region after the last `</think>` tag."""
+    text = text or ""
+    last_close_end: Optional[int] = None
+    for match in _THINK_CLOSE_RE.finditer(text):
+        last_close_end = match.end()
+    if last_close_end is not None:
+        return text[last_close_end:].lstrip()
+    if _THINK_OPEN_RE.search(text):
+        return ""
+    return text
+
+
+def _strip_outer_wrappers(text: str) -> str:
+    """Peel simple answer wrappers like markdown, quotes, brackets, or `<answer>` tags."""
+    text = (text or "").strip()
+    changed = True
+    while text and changed:
+        changed = False
+        lowered = text.lower()
+
+        # Strip explicit answer wrappers before more generic marker peeling.
+        if lowered[:8] == "<answer>" and lowered[-9:] == "</answer>":
+            text = text[8:-9].strip()
+            changed = True
+            continue
+
+        if lowered[:7] == "\\boxed{" and text.endswith("}"):
+            text = text[7:-1].strip()
+            changed = True
+            continue
+
+        for marker in _OUTER_MARKERS:
+            if text.startswith(marker) and text.endswith(marker) and len(text) > len(marker) * 2:
+                text = text[len(marker) : -len(marker)].strip()
+                changed = True
+                break
+        if changed:
+            continue
+
+        for opener, closer in _OUTER_WRAPPER_PAIRS:
+            if text.startswith(opener) and text.endswith(closer) and len(text) > len(opener) + len(closer):
+                text = text[len(opener) : -len(closer)].strip()
+                changed = True
+                break
 
+    return text
+
+
+def _line_bounds(text: str, start: int, end: int) -> tuple[int, int]:
+    """Return the line boundaries that contain the span `[start, end)`."""
+    line_start = text.rfind("\n", 0, start) + 1
+    line_end = text.find("\n", end)
+    if line_end == -1:
+        line_end = len(text)
+    return line_start, line_end
+
+
+def _previous_nonempty_line_start(text: str, line_start: int) -> int:
+    """Walk backward to the previous non-empty line start, if one exists."""
+    cursor = line_start
+    while cursor > 0:
+        prev_end = cursor - 1
+        prev_start = text.rfind("\n", 0, prev_end) + 1
+        if text[prev_start:prev_end].strip():
+            return prev_start
+        cursor = prev_start
+    return line_start
+
+
+def _next_nonempty_line_end(text: str, line_end: int) -> int:
+    """Walk forward to the next non-empty line end, if one exists."""
+    cursor = line_end
+    while cursor < len(text):
+        next_start = cursor + 1 if cursor < len(text) and text[cursor] == "\n" else cursor
+        next_end = text.find("\n", next_start)
+        if next_end == -1:
+            next_end = len(text)
+        if text[next_start:next_end].strip():
+            return next_end
+        if next_end == len(text):
+            break
+        cursor = next_end
+    return line_end
+
+
+def _local_context(text: str, start: int, end: int) -> tuple[str, int, int]:
+    """Return a bounded local region around a candidate plus its relative offsets."""
+    line_start, line_end = _line_bounds(text, start, end)
+    context_start = _previous_nonempty_line_start(text, line_start)
+    context_end = _next_nonempty_line_end(text, line_end)
+    # Prefer whole nearby lines, then cap to fixed windows so long CoTs stay cheap.
+    context_start = max(context_start, start - LOCAL_CONTEXT_BEFORE_CHARS)
+    context_end = min(context_end, end + LOCAL_CONTEXT_AFTER_CHARS)
+    return text[context_start:context_end], start - context_start, end - context_start
+
+
+def _candidate_is_negated(context: str, rel_start: int, rel_end: int) -> bool:
+    """Detect local negation patterns that should invalidate a candidate option."""
+    prefix = context[max(0, rel_start - 48) : rel_start]
+    suffix = context[rel_end : min(len(context), rel_end + 40)]
+    prefix = normalize_for_match(prefix).rstrip(" ([{<【")
+    suffix = normalize_for_match(suffix)
+
+    if _NEGATION_PREFIX_RE.search(prefix):
+        return True
+    if prefix.endswith("rather than") or prefix.endswith("except"):
+        return True
+    if "wrong diagnosis is" in prefix[-32:] or "incorrect diagnosis is" in prefix[-32:]:
+        return True
+
+    for prefix_text in _AFTER_REJECTION_PREFIXES:
+        if suffix.startswith(prefix_text):
+            return True
+
+    return False
+
+
+def _looks_like_option_connector(between_norm: str) -> bool:
+    """Return True when the text between two options is just list/connector glue."""
+    between_norm = between_norm.strip()
+    if not between_norm:
+        return True
+
+    between_norm = re.sub(r"\b(?:option|choice)\b", " ", between_norm).strip()
+    stripped = between_norm.strip(",;:./&+()[]{}<>-\\ ")
+    if not stripped:
+        return True
+
+    return stripped in _COMPACT_OPTION_CONNECTORS
+
+
+def _is_harmless_option_match(text: str, match: re.Match[str]) -> bool:
+    """Ignore stray single-letter matches like pronoun `I` or apostrophe fragments."""
+    token = match.group("opt").casefold()
+    start = match.start("opt")
+    end = match.end("opt")
+
+    if token == "i":
+        before = text[start - 1] if start > 0 else " "
+        after = text[end] if end < len(text) else " "
+        if before in {" ", "\n", "\t", ",", ";", ".", "(", "["} and after in {
+            " ",
+            "\n",
+            "\t",
+            ",",
+            ";",
+            ".",
+            "!",
+            "?",
+            ")",
+            "]",
+        }:
+            return True
+    if token == "i" and start == 0:
+        return True
+    if start > 0 and text[start - 1] in {"'", "’"}:
+        return True
+    if end < len(text) and text[end] in {"'", "’"}:
+        return True
+    return False
+
+
+def _candidate_has_local_competing_option(
+    context: str, rel_start: int, rel_end: int, token: str, answer_letter: str
+) -> bool:
+    """Reject candidates that are locally entangled with another option token."""
+    selected_span = (rel_start, rel_end)
+    for match in _OPTION_TOKEN_RE.finditer(context):
+        if _is_harmless_option_match(context, match):
+            continue
+        other = _norm_option(match.group("opt"))
+        if other is None or not _option_kind_matches(other, answer_letter) or other == token:
+            continue
+
+        if match.end() <= selected_span[0]:
+            between = context[match.end() : selected_span[0]]
+        elif selected_span[1] <= match.start():
+            between = context[selected_span[1] : match.start()]
+        else:
+            continue
+
+        between_norm = normalize_for_match(between)
+        if len(between_norm) > 24:
+            continue
+        # Treat only very short glue like commas, "and", or "or" as true ambiguity.
+        if _looks_like_option_connector(between_norm):
+            return True
+
+    return False
+
+
+def _candidate_is_contradicted(context: str, rel_end: int, token: str, answer_letter: str) -> bool:
+    """Reject candidates that are immediately revised to a different option."""
+    suffix = normalize_for_match(context[rel_end : min(len(context), rel_end + 80)])
+    if not any(hint in suffix for hint in _CONTRAST_HINTS):
+        return False
 
-# Any letter/number token that looks like an option
-TOKEN_PATTERN = re.compile(r"(?<![\w+\-/])\(?\s*([A-Za-z]|\d{1,2})\s*[\)\.:]?(?![\w+\-/])", re.IGNORECASE)
+    for match in _OPTION_TOKEN_RE.finditer(suffix):
+        other = _norm_option(match.group("opt"))
+        if other is None or not _option_kind_matches(other, answer_letter):
+            continue
+        if other != token:
+            return True
+    return False
+
+
+def _candidate_is_valid(text: str, candidate: _Candidate, answer_letter: str) -> bool:
+    """Apply the local negation, ambiguity, and contradiction filters to a candidate."""
+    context, rel_start, rel_end = _local_context(text, candidate.start, candidate.end)
+    return not (
+        _candidate_is_negated(context, rel_start, rel_end)
+        or _candidate_has_local_competing_option(context, rel_start, rel_end, candidate.token, answer_letter)
+        or _candidate_is_contradicted(context, rel_end, candidate.token, answer_letter)
+    )
+
+
+def _extract_exact_option(text: str, answer_letter: str) -> Optional[str]:
+    """Accept responses that are exactly one standalone option token."""
+    stripped = _strip_outer_wrappers(text)
+    match = _EXACT_OPTION_RE.fullmatch(stripped)
+    if not match:
+        return None
+    predicted = _norm_option(match.group("opt"))
+    if predicted is None or not _option_kind_matches(predicted, answer_letter):
+        return None
+    return predicted
 
-# Leading option token like "B. Answer text" or "C) ..." at the start of the response
-LEADING_OPTION_PATTERN = re.compile(
-    r"^\s*(?:>\s*)?(?:(?:[-*+]\s+)|(?:\d{1,3}[.)]\s+))?\s*"  # blockquote / list prefixes
-    r"(?:[*_`~]+)?\s*\(?\s*([A-Za-z]|\d{1,2})\s*[\)\.:]\s*\)?\s*(?:[*_`~]+)?\s*(?!\w)",  # markdown wrappers
-    re.IGNORECASE,
-)
 
-# Negation words that invalidate nearby matches
-NEGATION_PATTERN = re.compile(r"\b(?:not|isn['’]t)\b", re.IGNORECASE)
+def _extract_exact_answer_text(text: str, answer_text: str) -> Optional[str]:
+    """Accept responses that are exactly the answer text after wrapper normalization."""
+    if not answer_text:
+        return None
+    stripped = _strip_outer_wrappers(text)
+    if normalize_for_answer_text_match(stripped) != answer_text:
+        return None
+    return answer_text
 
-# Negative-context phrases that indicate an option mention is NOT a selected answer
-NEGATIVE_AFTER_OPTION_PATTERN = re.compile(
-    r"^\s*(?:is|are|was|were)\s+(?:incorrect|wrong|false|not\s+correct)\b|^\s*not\s+correct\b",
-    re.IGNORECASE,
-)
 
-# Sentence boundary pattern - splits on period, exclamation, question mark, or newline
-# Handles both single newlines (for line breaks in CoT) and double newlines (paragraphs)
-SENTENCE_BOUNDARY = re.compile(r"[.!?]\s+|\n+")
+def _extract_exact_option_plus_text(text: str, answer_letter: str, answer_text: str) -> Optional[str]:
+    """Accept short option-led answers like `B. Correct answer text`."""
+    stripped = _strip_outer_wrappers(text)
+    match = _LEADING_OPTION_RE.fullmatch(stripped)
+    if not match:
+        return None
+    predicted = _norm_option(match.group("opt"))
+    if predicted is None or not _option_kind_matches(predicted, answer_letter):
+        return None
+    if normalize_for_answer_text_match(match.group("rest")) != answer_text:
+        return None
+    return predicted
+
 
+@lru_cache(maxsize=64)
+def _prefix_pattern(prefix_norm: str) -> re.Pattern[str]:
+    """Compile the caller-provided anchor prefix into the same option-capture shape."""
+    flexible_prefix = re.escape(prefix_norm).replace(r"\ ", r"\s+")
+    return re.compile(
+        rf"(?:^|(?<![a-z0-9])){flexible_prefix}\s*[:\-]?\s*(?:is\s+)?(?P<neg>not\s+|isn't\s+|isnt\s+)?"
+        rf"(?:(?:option|choice)\s+)?"
+        rf"(?:[*_`~]+\s*)*(?:\\boxed\{{\s*)?[\(\[\{{<【]*\s*(?P<opt>[A-Za-z]|\d{{1,2}})\s*"
+        rf"[\)\]\}}>】]*\s*(?:\}}\s*)?(?:[*_`~]+\s*)?(?![\w+\-/])",
+        re.IGNORECASE,
+    )
 
-def _get_sentence_containing_match(text: str, match: re.Match) -> str:
-    """Return (sentence_start, sentence_end, match_start, match_end) in the original text."""
-    if getattr(match.re, "groupindex", None) and "opt" in match.re.groupindex:
-        match_start, match_end = match.span("opt")
+
+def _latest_explicit_candidate(text: str, answer_letter: str, prefix: Optional[str]) -> Optional[_Candidate]:
+    """Return the latest valid anchored candidate, preferring a caller-specified prefix."""
+    if prefix:
+        prefix_norm = normalize_for_match(prefix)
+        if prefix_norm:
+            saw_prefix_match = False
+            latest_valid: Optional[_Candidate] = None
+            for match in _prefix_pattern(prefix_norm).finditer(text):
+                if not _prefix_match_has_standalone_start(text, match.start()):
+                    continue
+                saw_prefix_match = True
+                if match.groupdict().get("neg"):
+                    continue
+                token = _norm_option(match.group("opt"))
+                if token is None or not _option_kind_matches(token, answer_letter):
+                    continue
+                candidate = _Candidate(
+                    token=token,
+                    start=match.start("opt"),
+                    end=match.end("opt"),
+                    method="anchored_token",
+                )
+                if _candidate_is_valid(text, candidate, answer_letter):
+                    latest_valid = candidate
+            # If the caller supplied an explicit prefix, do not fall back to generic anchors
+            # once that prefix appears at all.
+            if saw_prefix_match:
+                return latest_valid
+
+    latest_valid = None
+    for match in _ANCHOR_RE.finditer(text):
+        if match.groupdict().get("neg"):
+            continue
+        token = _norm_option(match.group("opt"))
+        if token is None or not _option_kind_matches(token, answer_letter):
+            continue
+        candidate = _Candidate(token=token, start=match.start("opt"), end=match.end("opt"), method="anchored_token")
+        if _candidate_is_valid(text, candidate, answer_letter):
+            latest_valid = candidate
+
+    return latest_valid
+
+
+def _prefix_match_has_standalone_start(text: str, start: int) -> bool:
+    """Require prefix matches to start at a token boundary rather than inside a word."""
+    cursor = start - 1
+    while cursor >= 0 and text[cursor].isspace():
+        cursor -= 1
+    return cursor < 0 or not text[cursor].isalnum()
+
+
+def _leading_option_candidate(text: str, answer_letter: str, answer_text: str) -> Optional[_Candidate]:
+    """Parse a short option-led answer that starts with the selected option token."""
+    source = text
+    offset = 0
+    if "\n" in text:
+        # For multi-line responses, only trust the final non-empty line as a leading-option answer.
+        source = _last_nonempty_line(text)
+        if not source:
+            return None
+        offset = text.rfind(source)
+        match = _LEADING_OPTION_RE.match(source)
     else:
-        try:
-            match_start, match_end = match.span(1)
-        except Exception:
-            match_start, match_end = match.span()
+        match = _LEADING_OPTION_RE.match(source)
+        if not match:
+            source = _last_nonempty_line(text)
+            if not source:
+                return None
+            offset = text.rfind(source)
+            match = _LEADING_OPTION_RE.match(source)
+    if not match:
+        return None
+
+    token = _norm_option(match.group("opt"))
+    if token is None or not _option_kind_matches(token, answer_letter):
+        return None
+
+    # Plain prose like "I think B works" should not be treated as an option-led format.
+    separator = source[match.end("opt") : match.start("rest")]
+    rest = match.group("rest").lstrip()
+    if not any(char in separator for char in ")]}>】.:-*_`~\\") and not rest.startswith(
+        ("(", "[", "{", "<", "【", '"', "'", "\\boxed{")
+    ):
+        return None
+
+    # Reject enumerated multi-option payloads like "A. ...\nD. ...".
+    if _contains_multiple_option_led_sentences(text, answer_letter):
+        return None
 
-    boundaries_before = [m.end() for m in SENTENCE_BOUNDARY.finditer(text[:match_start])]
-    boundaries_after = [m.start() for m in SENTENCE_BOUNDARY.finditer(text[match_end:])]
+    candidate = _Candidate(
+        token=token,
+        start=offset + match.start("opt"),
+        end=offset + match.end("opt"),
+        method="anchored_token",
+    )
+    if not _candidate_is_valid(text, candidate, answer_letter):
+        return None
+    return candidate
 
-    sentence_start = boundaries_before[-1] if boundaries_before else 0
-    sentence_end = match_end + boundaries_after[0] if boundaries_after else len(text)
-    return sentence_start, sentence_end, match_start, match_end
 
+def _last_nonempty_line(text: str) -> str:
+    """Return the final non-empty line from the response, if any."""
+    for line in reversed((text or "").splitlines()):
+        if line.strip():
+            return line.strip()
+    return ""
 
-def _negated_near(text: str, match: re.Match) -> bool:
-    """Check for negation that appears before the match within the same sentence.
 
-    This is used for answer_text matching to avoid blocking answers that legitimately contain
-    words like "not" (e.g., "do not resuscitate") while still blocking cases like
-    "not <answer_text>".
-    """
-    sentence_start, sentence_end, match_start, _match_end = _get_sentence_containing_match(text, match)
-    prefix = text[sentence_start:match_start]
-    return bool(NEGATION_PATTERN.search(prefix))
+def _is_compact_multi_option_list(text: str, answer_letter: str) -> bool:
+    """Detect short tails like `A, C` or `B and D` that should fail closed."""
+    matches = [
+        match
+        for match in _OPTION_TOKEN_RE.finditer(text)
+        if _option_kind_matches(_norm_option(match.group("opt")), answer_letter)
+    ]
+    if len(matches) < 2:
+        return False
 
+    if len(text.strip()) > 40:
+        return False
 
-def _negative_after_option(text: str, match: re.Match) -> bool:
-    """Check if an option token is immediately followed by negative context like 'C is incorrect'."""
-    _sentence_start, sentence_end, _match_start, match_end = _get_sentence_containing_match(text, match)
-    suffix = text[match_end:sentence_end]
-    return bool(NEGATIVE_AFTER_OPTION_PATTERN.search(suffix))
+    for idx in range(len(matches) - 1):
+        between = normalize_for_match(text[matches[idx].end() : matches[idx + 1].start()])
+        if not _looks_like_option_connector(between):
+            return False
+
+    return True
+
+
+def _tail_choice_text(text: str) -> str:
+    """Extract the short trailing segment that feeds the tail-choice fallback."""
+    region = (text or "").strip()
+    if not region:
+        return ""
+
+    parts = re.split(r"\n+|[.!?]\s+", region)
+    tail_choice = parts[-1].strip() if parts else region
+    if not tail_choice:
+        tail_choice = _last_nonempty_line(region)
+    # Long trailing prose is too ambiguous for the tail-choice heuristic.
+    if len(tail_choice.split()) > TAIL_CHOICE_MAX_WORDS:
+        return ""
+    return tail_choice
+
+
+def _contains_multiple_option_led_sentences(text: str, answer_letter: str) -> bool:
+    """Detect multi-line or multi-sentence payloads that enumerate different option labels."""
+    distinct: set[str] = set()
+    # Newline-separated enumerations are common in model outputs, so keep lines intact in that case.
+    chunks = (text or "").splitlines() if "\n" in (text or "") else re.split(r"[.!?]\s+", text or "")
+    for chunk in chunks:
+        match = _SENTENCE_OPTION_START_RE.match(chunk.strip())
+        if not match:
+            continue
+        token = _norm_option(match.group("opt"))
+        if token is None or not _option_kind_matches(token, answer_letter):
+            continue
+        distinct.add(token)
+        if len(distinct) > 1:
+            return True
+    return False
+
+
+def _tail_candidate(region: str, answer_letter: str) -> Optional[_Candidate]:
+    """Extract a last-line or tail-choice option token from the terminal region."""
+    line = _last_nonempty_line(region)
+    # Prefer an exact last-line option like "(C)" before falling back to a looser tail-choice scan.
+    if line and not _is_compact_multi_option_list(line, answer_letter):
+        match = _TERMINAL_OPTION_LINE_RE.fullmatch(line)
+        if match:
+            token = _norm_option(match.group("opt"))
+            if token is not None and _option_kind_matches(token, answer_letter):
+                line_offset = region.rfind(line)
+                start = line_offset + match.start("opt")
+                end = line_offset + match.end("opt")
+                candidate = _Candidate(token=token, start=start, end=end, method="last_token")
+                if _candidate_is_valid(region, candidate, answer_letter):
+                    return candidate
+
+    tail_choice = _tail_choice_text(region)
+    if not tail_choice or _is_compact_multi_option_list(tail_choice, answer_letter):
+        return None
 
+    match = _TAIL_CHOICE_OPTION_RE.search(tail_choice)
+    if not match:
+        return None
 
-def _tail_region(text: str, max_tokens: int = 64) -> str:
-    """Return a short tail slice (last sentence/line) to reduce option-token noise."""
-    boundaries = list(SENTENCE_BOUNDARY.finditer(text))
-    tail = text[boundaries[-1].end() :] if boundaries else text
-    tail = tail.strip()
+    token = _norm_option(match.group("opt"))
+    if token is None or not _option_kind_matches(token, answer_letter):
+        return None
 
-    if not tail:
-        for line in reversed(text.splitlines()):
-            if line.strip():
-                tail = line.strip()
-                break
+    tail_choice_offset = region.rfind(tail_choice)
+    candidate = _Candidate(
+        token=token,
+        start=tail_choice_offset + match.start("opt"),
+        end=tail_choice_offset + match.end("opt"),
+        method="last_token",
+    )
+    if not _candidate_is_valid(region, candidate, answer_letter):
+        return None
+    return candidate
+
+
+def _answer_text_pattern(answer_text: str) -> re.Pattern[str]:
+    """Compile a whitespace-tolerant exact-answer-text regex."""
+    flexible_answer = re.escape(answer_text).replace(r"\ ", r"\s+")
+    return re.compile(rf"(?<!\w){flexible_answer}(?!\w)", re.IGNORECASE)
+
+
+def _latest_answer_text_match(region: str, answer_text: str, answer_letter: str) -> Optional[str]:
+    """Return the latest valid exact answer-text match inside a search region."""
+    region_struct = normalize_for_structure(region)
+    if not answer_text or not region_struct:
+        return None
 
-    tokens = tail.split()
-    if len(tokens) > max_tokens:
-        tail = " ".join(tokens[-max_tokens:])
-    return tail
+    latest_valid: Optional[str] = None
+    for match in _answer_text_pattern(answer_text).finditer(region_struct):
+        if _answer_text_match_is_valid(region_struct, match.start(), match.end(), answer_letter):
+            latest_valid = answer_text
+
+    return latest_valid
+
+
+def _answer_text_match_is_valid(region_struct: str, start: int, end: int, answer_letter: str) -> bool:
+    """Reject answer-text matches that sit inside obvious negation or option-list structure."""
+    prefix = region_struct[max(0, start - 64) : start].rstrip()
+    if _NEGATION_PREFIX_RE.search(prefix):
+        return False
+    if prefix.endswith("rather than") or prefix.endswith("except"):
+        return False
+    if "wrong diagnosis is" in prefix[-40:] or "incorrect diagnosis is" in prefix[-40:]:
+        return False
+
+    line_start, line_end = _line_bounds(region_struct, start, end)
+    raw_line = region_struct[line_start:line_end]
+    rel_start = start - line_start
+    rel_end = end - line_start
+    leading_match = _LEADING_OPTION_RE.match(raw_line.strip())
+    if leading_match is not None:
+        token = _norm_option(leading_match.group("opt"))
+        if token is not None and _option_kind_matches(token, answer_letter):
+            return False
+
+    # Bulleted or numbered option-analysis lines often mention distractor answer text verbatim.
+    if _BULLET_OR_LIST_LINE_RE.match(raw_line):
+        before_match = raw_line[:rel_start]
+        after_match = raw_line[rel_end:].lstrip(" *_`~)]}>】")
+        if ":" in before_match or any(marker in before_match for marker in (" - ", " – ", " — ")):
+            return False
+        if after_match.startswith((":", "-", "–", "—")):
+            return False
+
+    return True
+
+
+def _answer_text_regions(text: str, answer_text: str, is_long: bool) -> list[str]:
+    """Choose the bounded regions where answer-text fallback is allowed to search."""
+    if is_long:
+        # In long mode, the tail is authoritative because earlier reasoning is frequently revised.
+        return [text[-TERMINAL_WINDOW_CHARS:]]
+
+    if len(text) <= 800:
+        return [text]
+
+    # For shorter responses, search bounded tail/head windows but align them to line
+    # boundaries so local validation still sees bullet markers and nearby list structure.
+    window = max(600, min(1_400, len(answer_text) + 400))
+    line_slack = 200
+
+    # Only stretch to the next line break when it is still close to the window edge.
+    head_end = text.find("\n", window, min(len(text), window + line_slack + 1))
+    if head_end == -1:
+        head_end = min(len(text), window)
+    head = text[:head_end]
+
+    tail_start = max(0, len(text) - window)
+    aligned_tail_start = text.rfind("\n", max(0, tail_start - line_slack), tail_start)
+    if aligned_tail_start != -1:
+        tail_start = aligned_tail_start + 1
+    tail = text[tail_start:]
+
+    if head == tail:
+        return [head]
+    return [tail, head]
 
 
 def multiple_choice_accuracy(
@@ -214,140 +771,112 @@ def multiple_choice_accuracy(
     strip_tex: bool = True,
     return_details: bool = False,
 ) -> bool | MCQAccuracyResult:
-    """
-    Grade a multiple-choice answer with layered strategies:
-
-    1. Direct answer: Response is just the option letter/number
-    2. Anchored token: Use the last occurrence of a provided prefix, otherwise general anchor phrases
-    3. Last token: Take the last letter/number found anywhere
-    4. Answer text: Match the full answer text (if long enough)
-
-    Args:
-        llm_answer: The model's response text
-        answer_letter: The correct answer letter/number (e.g., "C" or "3")
-        answer_text: The full correct answer text
-        prefix: Optional prefix to strip (e.g., "The answer is: ")
-        accept_answer_text: Whether to fall back to text matching
-        strip_tex: Whether to strip LaTeX formatting
-        return_details: If True, return MCQAccuracyResult dataclass instead of bool
-
-    Returns:
-        bool (if return_details=False) or MCQAccuracyResult (if return_details=True)
-    """
-
-    def _result(
-        is_correct: bool, method: str, predicted: str | None, actual: str | None, return_details: bool
-    ) -> bool | MCQAccuracyResult:
-        """Helper to format return value."""
-        if not return_details:
-            return is_correct
-        return MCQAccuracyResult(
-            is_correct=is_correct,
-            method=method,
-            matched_answer=predicted,
-            correct_answer=actual,
-        )
+    """Grade an MCQ answer using short-mode scans and tail-authoritative long-mode scans."""
 
     if not llm_answer:
         return _result(False, "none", None, None, return_details)
 
-    # Normalize the response
-    llm_answer = _remove_think_tags(llm_answer)
-
+    # Strip reasoning wrappers and normalize before any extraction logic runs.
+    processed_answer = _remove_think_tags(llm_answer)
+    processed_answer = _ANSWER_TAG_RE.sub(" ", processed_answer)
     if strip_tex:
-        llm_answer = _strip_tex(llm_answer)
-        answer_text = _strip_tex(answer_text)
-
-    llm_answer_original = llm_answer
+        processed_answer = _strip_tex(processed_answer)
+        answer_text = _strip_tex(answer_text or "")
 
-    # Normalize: casefold only (preserve whitespace structure for sentence detection)
-    llm_answer = _nfkc_casefold(llm_answer)
+    structural_text = normalize_for_structure(processed_answer).strip()
+    answer_letter = _norm_option(answer_letter)
+    answer_text = normalize_for_answer_text_match(answer_text or "")
+    exact_answer_text_allowed = accept_answer_text and bool(answer_text)
+    answer_text_fallback_allowed = accept_answer_text and _answer_text_supports_fallback(answer_text)
 
-    answer_letter = _norm_letter(answer_letter)
-    answer_text = _nfkc_casefold(_normalize_spaces(answer_text or ""))
     if answer_letter is None:
         raise ValueError(f"Invalid answer_letter '{answer_letter=}'. Must be a single letter or digit string.")
 
-    explicit_choice_found = False
-
-    # Strategy 1: Only answer letter anywhere (without anchoring)
-    if answer_letter == _norm_letter(llm_answer):
-        return _result(True, "direct_answer", llm_answer, answer_letter, return_details)
-
-    # Strategy 2: Accept leading option token like "B. answer ..."
-    leading_match = LEADING_OPTION_PATTERN.match(llm_answer_original)
-    if leading_match and answer_letter:
-        predicted = _norm_letter(leading_match.group(1))
-        if _token_kind_matches_answer_letter(predicted, answer_letter):
-            explicit_choice_found = True
-        if predicted == answer_letter:
-            return _result(True, "anchored_token", predicted, answer_letter, return_details)
-
-    # Strategy 3: Anchored token (prefix matches first, fallback to generic anchors)
-    prefix_matches = []
-    if prefix:
-        prefix_norm = _nfkc_casefold(prefix).strip()
-        if prefix_norm:
-            flexible_prefix = re.escape(prefix_norm).replace(r"\ ", r"\s+")
-            prefix_pattern = re.compile(
-                rf"{flexible_prefix}\s*[:\-–—]?\s*(?:is\s*)?(?P<neg>not\s+|isn['’]t\s+)?\(?\s*(?P<opt>[A-Za-z]|\d{{1,2}})\s*[\)\.:]?(?![\w+\-/])",
-                re.IGNORECASE,
-            )
-            prefix_matches = list(prefix_pattern.finditer(llm_answer))
-
-    anchored_matches = prefix_matches if prefix_matches else list(ANCHOR_PATTERN.finditer(llm_answer))
-    if anchored_matches and answer_letter:
-        last_match = anchored_matches[-1]
-        predicted = _norm_letter(last_match.group("opt"))
-        if last_match.group("neg") is None and _token_kind_matches_answer_letter(predicted, answer_letter):
-            explicit_choice_found = True
-        if predicted == answer_letter and last_match.group("neg") is None:
-            return _result(True, "anchored_token", predicted, answer_letter, return_details)
-
-    # Strategy 4: Last token in the answer tail, ignore negative contexts like "C is incorrect",
-    if not explicit_choice_found and answer_letter:
-        tail = _tail_region(llm_answer)
-        tail_tokens = list(TOKEN_PATTERN.finditer(tail))
-        if tail_tokens:
-            # Take the last non-negated, non-negative-context token.
-            for token_match in reversed(tail_tokens):
-                predicted = _norm_letter(token_match.group(1))
-                if predicted is None:
-                    continue
-                if _negated_near(tail, token_match):
-                    continue
-                if _negative_after_option(tail, token_match):
-                    continue
-                if predicted == answer_letter:
-                    return _result(True, "last_token", predicted, answer_letter, return_details)
+    if not structural_text:
+        return _result(False, "none", None, None, return_details)
 
-    # Strategy 5: Exact answer text match if there's no explicit choice found
-    # Only search at beginning and end to avoid matching reasoning in the middle
-    if accept_answer_text and answer_text and not explicit_choice_found:
-        # Calculate search regions based on token count
-        answer_tokens = len(answer_text.split())
-        buffer_tokens = answer_tokens + 15  # Extra tokens for preamble like "The answer is:"
+    # Strategy 1: exact standalone option, e.g. "C" or "(2)".
+    direct_option = _extract_exact_option(structural_text, answer_letter)
+    if direct_option == answer_letter:
+        return _result(
+            True,
+            "direct_answer",
+            direct_option.casefold(),
+            answer_letter,
+            return_details,
+        )
 
-        llm_tokens = llm_answer.split()
+    # Strategy 2: exact answer text after wrapper normalization. This remains allowed
+    # even for numeric answer text, so parsed outputs like "\boxed{4}" can still match
+    # the gold content answer text before a mismatched standalone numeral fails closed.
+    if exact_answer_text_allowed:
+        direct_text = _extract_exact_answer_text(structural_text, answer_text)
+        if direct_text is not None:
+            return _result(True, "answer_text", direct_text, answer_text, return_details)
+
+    if direct_option is not None:
+        return _result(
+            False,
+            "direct_answer",
+            direct_option.casefold(),
+            answer_letter,
+            return_details,
+        )
 
-        beginning_tokens = llm_tokens[:buffer_tokens]
-        end_tokens = llm_tokens[-buffer_tokens:] if len(llm_tokens) > buffer_tokens else llm_tokens
+    # Strategy 3: short option-led answer that also includes the answer text.
+    option_plus_text = _extract_exact_option_plus_text(structural_text, answer_letter, answer_text)
+    if option_plus_text is not None:
+        return _result(
+            option_plus_text == answer_letter,
+            "anchored_token",
+            option_plus_text,
+            answer_letter,
+            return_details,
+        )
 
-        beginning_region = " ".join(beginning_tokens)
-        end_region = " ".join(end_tokens)
+    is_long = len(structural_text) > LONG_RESPONSE_THRESHOLD_CHARS
+    terminal_region = structural_text[-TERMINAL_WINDOW_CHARS:] if is_long else structural_text
+    strong_tail_region = terminal_region[-STRONG_TAIL_WINDOW_CHARS:] if is_long else structural_text
+
+    # Strategy 4: anchored commitments like "final answer: C".
+    explicit_candidate = _latest_explicit_candidate(terminal_region, answer_letter, prefix)
+    if explicit_candidate is not None:
+        return _result(
+            explicit_candidate.token == answer_letter,
+            explicit_candidate.method,
+            explicit_candidate.token,
+            answer_letter,
+            return_details,
+        )
 
-        # Make answer_text flexible for whitespace variations
-        flexible_answer = re.escape(answer_text).replace(r"\ ", r"\s+")
-        pattern = re.compile(rf"(?<!\w){flexible_answer}(?!\w)", re.IGNORECASE)
+    # Strategy 5: leading-option forms are only trusted in short responses.
+    if not is_long:
+        leading_candidate = _leading_option_candidate(structural_text, answer_letter, answer_text)
+        if leading_candidate is not None:
+            return _result(
+                leading_candidate.token == answer_letter,
+                leading_candidate.method,
+                leading_candidate.token,
+                answer_letter,
+                return_details,
+            )
 
-        # Check beginning first
-        match = pattern.search(beginning_region)
-        if match and not _negated_near(beginning_region, match):
-            return _result(True, "answer_text", beginning_region, answer_text, return_details)
+    # Strategy 6: tail-only token fallback from the last line or short tail choice text.
+    tail_candidate = _tail_candidate(strong_tail_region, answer_letter)
+    if tail_candidate is not None:
+        return _result(
+            tail_candidate.token == answer_letter,
+            tail_candidate.method,
+            tail_candidate.token,
+            answer_letter,
+            return_details,
+        )
 
-        # Then check end (after reasoning)
-        match = pattern.search(end_region)
-        if match and not _negated_near(end_region, match):
-            return _result(True, "answer_text", end_region, answer_text, return_details)
+    # Strategy 7: exact answer-text fallback in bounded head/tail regions.
+    if answer_text_fallback_allowed and answer_text:
+        for region in _answer_text_regions(structural_text, answer_text, is_long):
+            matched = _latest_answer_text_match(region, answer_text, answer_letter)
+            if matched is not None:
+                return _result(True, "answer_text", matched, answer_text, return_details)
 
     return _result(False, "none", None, None, return_details)
diff --git a/tests/test_cli/test_main.py b/tests/test_cli/test_main.py
index 4bee1d52..c78b6cad 100644
--- a/tests/test_cli/test_main.py
+++ b/tests/test_cli/test_main.py
@@ -1794,10 +1794,11 @@ def test_process_cli_applies_config_defaults(monkeypatch: pytest.MonkeyPatch, tm
     cfg_path = tmp_path / "process.yaml"
     cfg_path.write_text(
         f"""
-        runs_dir: runs-from-config
-        output_dir: processed-from-config
-        env_config_root: {env_root}
-        max_workers: 2
+        runs_dir: runs/raw-from-config
+        process:
+          dir: processed
+          env_config_root: {env_root}
+          max_workers: 2
         hf:
           repo: medarc/demo
           branch: main
@@ -1821,8 +1822,8 @@ def fake_run(options, env_export_map):
     assert exit_code == 0
 
     options = captured["options"]
-    assert options.runs_dir == Path("runs-from-config")
-    assert options.output_dir == Path("processed-from-config")
+    assert options.runs_dir == Path("runs/raw-from-config")
+    assert options.output_dir == Path("runs/processed")
     assert options.max_workers == 2
     assert options.hf_pull_policy == "pull"
     assert options.hf_config is not None
@@ -1837,24 +1838,62 @@ def fake_run(options, env_export_map):
     assert options.hf_config is not None
     assert options.hf_config.token == "override"
 
+    exit_code = main.main(["process", "--config", str(cfg_path), "--hf-pull-policy", "continue-upload", "--dry-run"])
+    assert exit_code == 0
+    options = captured["options"]
+    assert options.hf_pull_policy == "continue-upload"
+
+
+def test_process_cli_resolves_hf_token_env_reference(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
+    cfg_path = tmp_path / "process.yaml"
+    cfg_path.write_text(
+        """
+        runs_dir: runs/raw-from-config
+        process:
+          dir: processed
+        hf:
+          repo: medarc/demo
+          token: $HF_TOKEN
+        """,
+        encoding="utf-8",
+    )
+    monkeypatch.setenv("HF_TOKEN", "env-secret")
+
+    captured: dict[str, Any] = {}
+
+    def fake_run(options, env_export_map):
+        captured["options"] = options
+        return ProcessResult(records_processed=0, rows_processed=0, env_groups=[], env_summaries=[], hf_summary=None)
+
+    monkeypatch.setattr("medarc_verifiers.cli.main.run_process", fake_run)
+
+    exit_code = main.main(["process", "--config", str(cfg_path), "--dry-run"])
+    assert exit_code == 0
+    assert captured["options"].hf_config is not None
+    assert captured["options"].hf_config.token == "env-secret"
+
 
 def test_winrate_cli_applies_config_defaults(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
     cfg_path = tmp_path / "winrate.yaml"
     cfg_path.write_text(
         """
-        processed_dir: runs-from-config
-        output_name: from-config
-        missing_policy: zero
-        epsilon: 0.123
-        min_common: 7
-        weight_policy: equal
-        weight_cap: 99
-        include_models: [alpha, beta]
-        exclude_model: gamma
+        runs_dir: runs/raw-from-config
+        process:
+          dir: processed
+        winrate:
+          output_name: from-config
+          missing_policy: zero
+          epsilon: 0.123
+          min_common: 7
+          weight_policy: equal
+          weight_cap: 99
+          include_models: [alpha, beta]
+          exclude_model: gamma
         hf:
           repo: medarc/demo
           branch: main
           token: secret-token
+          winrate_dir: scorecards/latest
         """,
         encoding="utf-8",
     )
@@ -1876,17 +1915,23 @@ def fake_run_winrate(
         }
         return SimpleNamespace(
             output_path=tmp_path / "out.json",
+            output_paths=[tmp_path / "out.json"],
             result={"models": {}},
             datasets=[("demo-env", [Path("demo-env.parquet")])],
         )
 
+    def fake_sync_files_to_hub(**kwargs):
+        captured["upload"] = kwargs
+
     monkeypatch.setattr(main, "run_winrate", fake_run_winrate)
+    monkeypatch.setattr(main, "sync_files_to_hub", fake_sync_files_to_hub)
     monkeypatch.setattr(main, "print_winrate_summary_markdown", lambda *_args, **_kwargs: None)
 
     exit_code = main.main(["winrate", "--config", str(cfg_path), "--processed-at", "2024-01-01T00:00:00Z"])
     assert exit_code == 0
 
-    assert captured["run_kwargs"]["processed_dir"] == Path("runs-from-config")
+    assert captured["run_kwargs"]["processed_dir"] == Path("runs/processed")
+    assert captured["run_kwargs"]["output_dir"] == Path("runs/processed") / "winrate"
     cfg = captured["run_kwargs"]["config"]
     assert cfg.missing_policy == "zero"
     assert cfg.epsilon == pytest.approx(0.123)
@@ -1895,6 +1940,12 @@ def fake_run_winrate(
     assert cfg.weight_cap == 99
     assert cfg.include_models == ("alpha", "beta")
     assert cfg.exclude_models == ("gamma",)
+    assert captured["run_kwargs"]["hf_config"] is not None
+    assert captured["run_kwargs"]["hf_config"].repo_id == "medarc/demo"
+    upload = captured.get("upload")
+    assert upload is not None
+    assert upload["repo_id"] == "medarc/demo"
+    assert upload["path_in_repo_prefix"] == "scorecards/latest"
 
     exit_code = main.main(
         [
@@ -1912,6 +1963,126 @@ def fake_run_winrate(
     assert cfg.epsilon == pytest.approx(0.5)
 
 
+def test_winrate_cli_resolves_hf_token_braced_env_reference(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    cfg_path = tmp_path / "winrate.yaml"
+    cfg_path.write_text(
+        """
+        processed_dir: runs/processed
+        hf:
+          repo: medarc/demo
+          token: ${HF_TOKEN}
+        """,
+        encoding="utf-8",
+    )
+    monkeypatch.setenv("HF_TOKEN", "env-secret")
+
+    captured: dict[str, Any] = {}
+
+    def fake_run_winrate(
+        *, processed_dir, output_dir, output_path, output_name, config, processed_at, hf_config, hf_processed_pull
+    ):
+        captured["hf_config"] = hf_config
+        return SimpleNamespace(
+            output_path=tmp_path / "out.json",
+            output_paths=[tmp_path / "out.json"],
+            result={"models": {}},
+            datasets=[],
+        )
+
+    monkeypatch.setattr(main, "run_winrate", fake_run_winrate)
+    monkeypatch.setattr(main, "print_winrate_summary_markdown", lambda *_args, **_kwargs: None)
+    monkeypatch.setattr(main, "sync_files_to_hub", lambda **_kwargs: None)
+
+    exit_code = main.main(["winrate", "--config", str(cfg_path), "--processed-at", "2024-01-01T00:00:00Z"])
+    assert exit_code == 0
+    assert captured["hf_config"] is not None
+    assert captured["hf_config"].token == "env-secret"
+
+
+def test_process_cli_rejects_unset_hf_token_env_reference(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+    capsys: pytest.CaptureFixture[str],
+) -> None:
+    cfg_path = tmp_path / "process.yaml"
+    cfg_path.write_text(
+        """
+        runs_dir: runs/raw-from-config
+        process:
+          dir: processed
+        hf:
+          repo: medarc/demo
+          token: $HF_TOKEN
+        """,
+        encoding="utf-8",
+    )
+    monkeypatch.delenv("HF_TOKEN", raising=False)
+
+    with pytest.raises(SystemExit) as excinfo:
+        main.main(["process", "--config", str(cfg_path), "--dry-run"])
+
+    assert excinfo.value.code == 2
+    assert "references unset environment variable 'HF_TOKEN'" in capsys.readouterr().err
+
+
+def test_expand_embedded_process_config_promotes_process_section() -> None:
+    payload = {
+        "runs_dir": "runs/raw",
+        "process": {
+            "dir": "processed",
+            "max_workers": 8,
+            "replace_models": ["model-a"],
+        },
+        "winrate": {"dir": "scorecards"},
+    }
+
+    expanded = main._expand_embedded_pipeline_config(payload, mode="process")
+
+    assert expanded["runs_dir"] == "runs/raw"
+    assert expanded["output_dir"] == Path("runs/processed")
+    assert expanded["max_workers"] == 8
+    assert expanded["replace_models"] == ["model-a"]
+    assert "winrate" not in expanded
+    assert payload["process"]["dir"] == "processed"
+
+
+def test_expand_embedded_winrate_config_resolves_relative_dirs() -> None:
+    payload = {
+        "runs_dir": "artifacts/raw",
+        "process": {"dir": "processed"},
+        "winrate": {
+            "dir": "scorecards",
+            "missing_policy": "zero",
+            "hf_winrate_dir": "uploads/winrate",
+        },
+    }
+
+    expanded = main._expand_embedded_pipeline_config(payload, mode="winrate")
+
+    assert expanded["processed_dir"] == Path("artifacts/processed")
+    assert expanded["output_dir"] == Path("artifacts/processed/scorecards")
+    assert expanded["missing_policy"] == "zero"
+    assert expanded["hf_winrate_dir"] == "uploads/winrate"
+
+
+def test_expand_embedded_winrate_config_keeps_explicit_dirs() -> None:
+    payload = {
+        "processed_dir": "custom/processed",
+        "output_dir": "custom/winrate",
+        "runs_dir": "artifacts/raw",
+        "process": {"dir": "processed"},
+        "winrate": {"dir": "scorecards"},
+    }
+
+    expanded = main._expand_embedded_pipeline_config(payload, mode="winrate")
+
+    assert expanded["processed_dir"] == "custom/processed"
+    assert expanded["output_dir"] == "custom/winrate"
+
+
 def test_process_cli_requires_winrate_config_path(tmp_path: Path) -> None:
     missing_path = tmp_path / "missing.yaml"
     with pytest.raises(SystemExit):
@@ -1928,17 +2099,130 @@ def test_process_cli_requires_winrate_config_path(tmp_path: Path) -> None:
         )
 
 
-def test_process_cli_runs_winrate_post_step(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
-    cfg_path = tmp_path / "winrate.yaml"
+def test_process_cli_defaults_status_filter_to_completed(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
+    captured: dict[str, Any] = {}
+
+    def fake_run_process(options, env_export_map):
+        captured["options"] = options
+        return ProcessResult(records_processed=0, rows_processed=0, env_groups=[], env_summaries=[], hf_summary=None)
+
+    monkeypatch.setattr(main, "run_process", fake_run_process)
+
+    exit_code = main.main(
+        [
+            "process",
+            "--runs-dir",
+            str(tmp_path / "runs"),
+            "--output-dir",
+            str(tmp_path / "processed"),
+            "--dry-run",
+        ]
+    )
+
+    assert exit_code == 0
+    options = captured["options"]
+    assert options.status_filter == ("completed",)
+    assert options.processed_with_args["status"] == ["completed"]
+    assert options.max_results_missing_pct == pytest.approx(2.5)
+    assert options.processed_with_args["max_results_missing_pct"] == pytest.approx(2.5)
+
+
+def test_process_cli_uses_explicit_status_filter(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
+    captured: dict[str, Any] = {}
+
+    def fake_run_process(options, env_export_map):
+        captured["options"] = options
+        return ProcessResult(records_processed=0, rows_processed=0, env_groups=[], env_summaries=[], hf_summary=None)
+
+    monkeypatch.setattr(main, "run_process", fake_run_process)
+
+    exit_code = main.main(
+        [
+            "process",
+            "--runs-dir",
+            str(tmp_path / "runs"),
+            "--output-dir",
+            str(tmp_path / "processed"),
+            "--status",
+            "failed",
+            "--max-results-missing-pct",
+            "100",
+            "--dry-run",
+        ]
+    )
+
+    assert exit_code == 0
+    options = captured["options"]
+    assert options.status_filter == ("failed",)
+    assert options.processed_with_args["status"] == ["failed"]
+    assert options.max_results_missing_pct == pytest.approx(100.0)
+
+
+def test_process_cli_rejects_negative_max_results_missing_pct(
+    tmp_path: Path,
+    capsys: pytest.CaptureFixture[str],
+) -> None:
+    with pytest.raises(SystemExit) as excinfo:
+        main.main(
+            [
+                "process",
+                "--runs-dir",
+                str(tmp_path / "runs"),
+                "--output-dir",
+                str(tmp_path / "processed"),
+                "--max-results-missing-pct",
+                "-1",
+            ]
+        )
+
+    assert excinfo.value.code == 2
+    err = capsys.readouterr().err
+    assert "--max-results-missing-pct must be non-negative." in err
+
+
+def test_process_config_empty_status_uses_default_filter(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    cfg_path = tmp_path / "process.yaml"
     cfg_path.write_text(
         """
-        processed_dir: ignored
-        output_dir: winrate-out
-        output_name: from-config
-        missing_policy: zero
-        hf_processed_repo: ignored/also
-        hf_winrate_repo: medarc/winrate
-        hf_token: secret-token
+        runs_dir: runs/raw
+        process:
+          dir: processed
+          status: []
+        """,
+        encoding="utf-8",
+    )
+
+    captured: dict[str, Any] = {}
+
+    def fake_run_process(options, env_export_map):
+        captured["options"] = options
+        return ProcessResult(records_processed=0, rows_processed=0, env_groups=[], env_summaries=[], hf_summary=None)
+
+    monkeypatch.setattr(main, "run_process", fake_run_process)
+
+    exit_code = main.main(["process", "--config", str(cfg_path), "--dry-run"])
+
+    assert exit_code == 0
+    options = captured["options"]
+    assert options.status_filter == ("completed",)
+    assert options.processed_with_args["status"] == ["completed"]
+
+
+def test_process_cli_runs_embedded_winrate_post_step(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
+    cfg_path = tmp_path / "process.yaml"
+    cfg_path.write_text(
+        """
+        runs_dir: runs/raw
+        process:
+          dir: processed
+        winrate:
+          dir: scorecards
+          output_name: from-config
+          missing_policy: zero
+          hf_winrate_dir: winrate-post
         """,
         encoding="utf-8",
     )
@@ -1981,6 +2265,7 @@ def fake_sync_files_to_hub(
             "message": message,
             "branch": branch,
             "dry_run": dry_run,
+            **_kw,
         }
 
     monkeypatch.setattr(main, "run_process", fake_run_process)
@@ -1991,37 +2276,90 @@ def fake_sync_files_to_hub(
     exit_code = main.main(
         [
             "process",
-            "--runs-dir",
-            str(tmp_path / "runs"),
-            "--output-dir",
-            str(tmp_path / "processed"),
-            "--winrate",
+            "--config",
             str(cfg_path),
+            "--hf-repo",
+            "medarc/shared",
+            "--hf-token",
+            "secret-token",
         ]
     )
     assert exit_code == 0
-    assert captured["run_kwargs"]["processed_dir"] == Path(tmp_path / "processed")
-    assert captured["run_kwargs"]["output_dir"] == Path("winrate-out")
+    assert captured["run_kwargs"]["processed_dir"] == Path("runs/processed")
+    assert captured["run_kwargs"]["output_dir"] == Path("runs/processed/scorecards")
     assert captured["run_kwargs"]["hf_config"] is None
     assert captured["run_kwargs"]["hf_processed_pull"] is False
     upload = captured.get("upload")
     assert upload is not None
-    assert upload["repo_id"] == "medarc/winrate"
+    assert upload["repo_id"] == "medarc/shared"
     assert upload["token"] == "secret-token"
     assert upload["files"] == ["winrate.json"]
+    assert upload["path_in_repo_prefix"] == "winrate-post"
+
+
+def test_process_cli_defaults_winrate_output_dir_under_processed(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    cfg_path = tmp_path / "process.yaml"
+    cfg_path.write_text(
+        """
+        runs_dir: runs/raw
+        process:
+          dir: processed
+        winrate:
+          missing_policy: zero
+        """,
+        encoding="utf-8",
+    )
+
+    captured: dict[str, Any] = {}
+
+    def fake_run_process(options, env_export_map):
+        captured["options"] = options
+        return ProcessResult(records_processed=0, rows_processed=0, env_groups=[], env_summaries=[], hf_summary=None)
+
+    def fake_run_winrate(
+        *, processed_dir, output_dir, output_path, output_name, config, processed_at, hf_config, hf_processed_pull
+    ):
+        captured["run_kwargs"] = {
+            "processed_dir": processed_dir,
+            "output_dir": output_dir,
+        }
+        return SimpleNamespace(
+            output_path=Path(output_dir) / "winrate.json",
+            output_paths=[Path(output_dir) / "winrate.json"],
+            result={"models": {}},
+            datasets=[],
+        )
+
+    monkeypatch.setattr(main, "run_process", fake_run_process)
+    monkeypatch.setattr(main, "run_winrate", fake_run_winrate)
+    monkeypatch.setattr(main, "print_winrate_summary_markdown", lambda *_args, **_kwargs: None)
+
+    exit_code = main.main(
+        [
+            "process",
+            "--config",
+            str(cfg_path),
+        ]
+    )
+    assert exit_code == 0
+    assert captured["run_kwargs"]["processed_dir"] == Path("runs/processed")
+    assert captured["run_kwargs"]["output_dir"] == Path("runs/processed/winrate")
 
 
 def test_process_config_sets_winrate_path(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
     cfg_path = tmp_path / "process.yaml"
-    winrate_cfg = tmp_path / "winrate.yaml"
     fake_runs_dir = tmp_path / "runs" / "raw"
     fake_runs_dir.mkdir(parents=True)
-    winrate_cfg.write_text("output_dir: runs/winrate\n", encoding="utf-8")
     cfg_path.write_text(
         f"""
         runs_dir: {fake_runs_dir}
-        output_dir: runs/processed
-        winrate: {winrate_cfg}
+        process:
+          dir: processed
+        winrate:
+          enabled: true
         """,
         encoding="utf-8",
     )
@@ -2055,7 +2393,9 @@ def fake_run_winrate(
         ]
     )
     assert exit_code == 0
-    assert captured["run_kwargs"]["processed_dir"] == Path("runs/processed")
+    expected_processed_dir = fake_runs_dir.parent / "processed"
+    assert captured["run_kwargs"]["processed_dir"] == expected_processed_dir
+    assert captured["run_kwargs"]["output_dir"] == expected_processed_dir / "winrate"
 
 
 def test_process_cli_rejects_include_prompt_completion(tmp_path: Path) -> None:
@@ -2076,6 +2416,7 @@ def test_process_cli_rejects_include_prompt_completion(tmp_path: Path) -> None:
     ("field", "value"),
     [
         ("max_workers", "not-an-int"),
+        ("max_results_missing_pct", "not-a-float"),
         ("hf_request_timeout", "not-a-float"),
         ("hf_retries", "not-an-int"),
         ("hf_max_files_per_commit", "not-an-int"),
@@ -2106,6 +2447,97 @@ def test_process_cli_rejects_invalid_typed_config_values(
     assert value in err
 
 
+def test_process_cli_rejects_removed_top_level_max_run_missing_pct_config_key(
+    tmp_path: Path,
+    capsys: pytest.CaptureFixture[str],
+) -> None:
+    cfg_path = tmp_path / "process-removed-top-level.yaml"
+    cfg_path.write_text(
+        """
+        runs_dir: runs/raw
+        output_dir: runs/processed
+        max_run_missing_pct: 2.5
+        """,
+        encoding="utf-8",
+    )
+
+    with pytest.raises(SystemExit) as excinfo:
+        main.main(["process", "--config", str(cfg_path)])
+
+    assert excinfo.value.code == 2
+    err = capsys.readouterr().err
+    assert "Process config field 'max_run_missing_pct' was removed" in err
+    assert "max_results_missing_pct" in err
+
+
+def test_process_cli_rejects_removed_embedded_max_run_missing_pct_config_key(
+    tmp_path: Path,
+    capsys: pytest.CaptureFixture[str],
+) -> None:
+    cfg_path = tmp_path / "process-removed-embedded.yaml"
+    cfg_path.write_text(
+        """
+        runs_dir: runs/raw
+        process:
+          dir: processed
+          max_run_missing_pct: 2.5
+        """,
+        encoding="utf-8",
+    )
+
+    with pytest.raises(SystemExit) as excinfo:
+        main.main(["process", "--config", str(cfg_path)])
+
+    assert excinfo.value.code == 2
+    err = capsys.readouterr().err
+    assert "Process config field 'process.max_run_missing_pct' was removed" in err
+    assert "process.max_results_missing_pct" in err
+
+
+def test_winrate_cli_ignores_removed_process_only_missing_pct_key(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    cfg_path = tmp_path / "winrate-process-key.yaml"
+    cfg_path.write_text(
+        """
+        processed_dir: runs/processed
+        process:
+          max_run_missing_pct: 2.5
+        """,
+        encoding="utf-8",
+    )
+
+    captured: dict[str, Any] = {}
+
+    def fake_run_winrate(
+        *, processed_dir, output_dir, output_path, output_name, config, processed_at, hf_config, hf_processed_pull
+    ):
+        captured["processed_dir"] = processed_dir
+        return SimpleNamespace(
+            output_path=tmp_path / "out.json",
+            output_paths=[tmp_path / "out.json"],
+            result={"models": {}},
+            datasets=[],
+        )
+
+    monkeypatch.setattr(main, "run_winrate", fake_run_winrate)
+    monkeypatch.setattr(main, "print_winrate_summary_markdown", lambda *_args, **_kwargs: None)
+
+    exit_code = main.main(
+        [
+            "winrate",
+            "--config",
+            str(cfg_path),
+            "--processed-at",
+            "2024-01-01T00:00:00Z",
+        ]
+    )
+
+    assert exit_code == 0
+    assert captured["processed_dir"] == Path("runs/processed")
+
+
 @pytest.mark.parametrize(
     ("field", "value"),
     [
@@ -2161,9 +2593,7 @@ def fake_run(options, env_export_map):
     monkeypatch.setattr(main, "_load_env_export_map", lambda *_args, **_kwargs: {})
     monkeypatch.setattr(main, "run_process", fake_run)
 
-    exit_code = main.main(
-        ["process", "--config", str(cfg_path), "--max-workers", "2", "--dry-run", "--no-validate-manifest"]
-    )
+    exit_code = main.main(["process", "--config", str(cfg_path), "--max-workers", "2", "--dry-run"])
     assert exit_code == 0
     assert captured["options"].max_workers == 2
 
diff --git a/tests/test_cli/test_manifest_tools.py b/tests/test_cli/test_manifest_tools.py
index 3a813b1d..4274fb1e 100644
--- a/tests/test_cli/test_manifest_tools.py
+++ b/tests/test_cli/test_manifest_tools.py
@@ -11,6 +11,43 @@ def _write_json(path: Path, payload: dict) -> None:
     path.write_text(json.dumps(payload), encoding="utf-8")
 
 
+def _write_manifest(
+    run_dir: Path,
+    *,
+    num_examples: int | None = None,
+    rollouts_per_example: int | None = None,
+) -> None:
+    payload = {
+        "version": 3,
+        "run_id": "demo-run",
+        "name": "demo",
+        "config_source": "cfg.yaml",
+        "config_checksum": "x",
+        "created_at": "2024-01-01T00:00:00Z",
+        "updated_at": "2024-01-01T00:00:00Z",
+        "artifacts_root": ".",
+        "models": {},
+        "env_templates": {},
+        "jobs": [
+            {
+                "job_id": "job-1",
+                "model_id": "m",
+                "env_id": "e",
+                "env_template_id": "e:t",
+                "env_variant_id": "e",
+                "env_args": {},
+                "results_relpath": "job-1/results.jsonl",
+                "metadata_relpath": "job-1/metadata.json",
+                "status": "completed",
+                "num_examples": num_examples,
+                "rollouts_per_example": rollouts_per_example,
+            }
+        ],
+        "summary": {"total": 1, "completed": 1, "pending": 0, "failed": 0, "running": 0, "skipped": 0},
+    }
+    _write_json(run_dir / "run_manifest.json", payload)
+
+
 def test_validate_manifests_reports_broken_paths(tmp_path: Path) -> None:
     runs_dir = tmp_path / "runs" / "raw"
     run_dir = runs_dir / "demo-run"
@@ -49,3 +86,66 @@ def test_validate_manifests_reports_broken_paths(tmp_path: Path) -> None:
     assert result.manifests_checked == 1
     assert result.jobs_checked == 1
     assert any(issue.kind == "warning" and "fallback" in issue.message.lower() for issue in result.issues)
+
+
+def test_validate_manifests_accepts_partial_rollout_file(tmp_path: Path) -> None:
+    runs_dir = tmp_path / "runs" / "raw"
+    run_dir = runs_dir / "demo-run"
+    job_dir = run_dir / "job-1"
+    _write_json(job_dir / "metadata.json", {"env_id": "demo"})
+    (job_dir / "results.jsonl").write_text(
+        "\n".join(
+            [
+                '{"example_id": 1, "rollout_index": 0}',
+                '{"example_id": 2, "rollout_index": 0}',
+                '{"example_id": 1, "rollout_index": 1}',
+                '{"example_id": 2, "rollout_index": 1}',
+                '{"example_id": 1, "rollout_index": 2}',
+            ]
+        )
+        + "\n",
+        encoding="utf-8",
+    )
+    _write_manifest(run_dir, num_examples=2, rollouts_per_example=3)
+
+    result = validate_manifests_in_runs(runs_dir, strict=False)
+
+    assert result.manifests_checked == 1
+    assert result.jobs_checked == 1
+    assert result.issues == []
+
+
+def test_validate_manifests_reports_out_of_range_rollout_index(tmp_path: Path) -> None:
+    runs_dir = tmp_path / "runs" / "raw"
+    run_dir = runs_dir / "demo-run"
+    job_dir = run_dir / "job-1"
+    _write_json(job_dir / "metadata.json", {"env_id": "demo"})
+    (job_dir / "results.jsonl").write_text(
+        "\n".join(
+            [
+                '{"example_id": 1, "rollout_index": 0}',
+                '{"example_id": 2, "rollout_index": 0}',
+                '{"example_id": 1, "rollout_index": 3}',
+            ]
+        )
+        + "\n",
+        encoding="utf-8",
+    )
+    _write_manifest(run_dir, num_examples=2, rollouts_per_example=3)
+
+    result = validate_manifests_in_runs(runs_dir, strict=False)
+
+    assert any("out-of-range rollout_index" in issue.message for issue in result.issues)
+
+
+def test_validate_manifests_reports_malformed_last_jsonl_row(tmp_path: Path) -> None:
+    runs_dir = tmp_path / "runs" / "raw"
+    run_dir = runs_dir / "demo-run"
+    job_dir = run_dir / "job-1"
+    _write_json(job_dir / "metadata.json", {"env_id": "demo"})
+    (job_dir / "results.jsonl").write_text('{"example_id": 1}\n{"example_id": ', encoding="utf-8")
+    _write_manifest(run_dir, num_examples=1, rollouts_per_example=1)
+
+    result = validate_manifests_in_runs(runs_dir, strict=False)
+
+    assert any("failed to parse last JSONL row" in issue.message for issue in result.issues)
diff --git a/tests/test_cli/test_process_aggregate.py b/tests/test_cli/test_process_aggregate.py
index a8675ad3..b214ad18 100644
--- a/tests/test_cli/test_process_aggregate.py
+++ b/tests/test_cli/test_process_aggregate.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+from medarc_verifiers.cli.process.metadata import RunIdentity
 from medarc_verifiers.cli.process.aggregate import (
     AggregatedEnvRows,
     aggregate_rows_by_env,
@@ -117,3 +118,46 @@ def test_aggregate_rows_fills_missing_rollout_index_from_suffix() -> None:
     grouped = aggregate_rows_by_env(rows)
 
     assert sorted({row["rollout_index"] for row in grouped[0].rows}) == [0, 1]
+
+
+def test_aggregate_rows_use_attached_identities_for_fake_rollouts() -> None:
+    rows = [
+        {
+            "env_id": "env-a",
+            "base_env_id": "env-a",
+            "manifest_env_id": "env-a-rollout7",
+            "model_id": "model-a",
+            "job_run_id": "run-1",
+        },
+        {
+            "env_id": "env-a",
+            "base_env_id": "env-a",
+            "manifest_env_id": "env-a-rollout3",
+            "model_id": "model-a",
+            "job_run_id": "run-2",
+        },
+    ]
+    identities = [
+        RunIdentity(
+            model_id="model-a",
+            manifest_env_id="env-a-rollout7",
+            base_env_id="env-a",
+            rollout_index=7,
+            job_run_id="run-1",
+            output_env_id="env-a",
+        ),
+        RunIdentity(
+            model_id="model-a",
+            manifest_env_id="env-a-rollout3",
+            base_env_id="env-a",
+            rollout_index=3,
+            job_run_id="run-2",
+            output_env_id="env-a",
+        ),
+    ]
+
+    grouped = aggregate_rows_by_env(rows, identities=identities)
+
+    assert len(grouped) == 1
+    assert grouped[0].env_id == "env-a"
+    assert sorted({row["rollout_index"] for row in grouped[0].rows}) == [0, 1]
diff --git a/tests/test_cli/test_process_discovery.py b/tests/test_cli/test_process_discovery.py
index 5b532fcd..a41a6bed 100644
--- a/tests/test_cli/test_process_discovery.py
+++ b/tests/test_cli/test_process_discovery.py
@@ -3,7 +3,7 @@
 import json
 from pathlib import Path
 
-from medarc_verifiers.cli.process.discovery import discover_run_records
+from medarc_verifiers.cli.process.discovery import RunManifestInfo, discover_run_records
 
 
 def _write_json(path: Path, payload: dict) -> None:
@@ -33,6 +33,23 @@ def _base_manifest(
     }
 
 
+def _manifest_info(*, completed: int, total: int, total_known: bool) -> RunManifestInfo:
+    return RunManifestInfo(
+        job_run_id="job-run-123",
+        run_name="example-run",
+        summary_completed=completed,
+        summary_total=total,
+        summary_total_known=total_known,
+        manifest_path=Path("/tmp/run_manifest.json"),
+        run_dir=Path("/tmp/job-run-123"),
+        created_at="2024-01-01T00:00:00Z",
+        updated_at="2024-01-01T00:05:00Z",
+        config_source="configs/example.yaml",
+        config_checksum="abc123",
+        run_summary_path=Path("/tmp/run_summary.json"),
+    )
+
+
 def test_discover_run_records_basic(tmp_path: Path) -> None:
     runs_dir = tmp_path / "runs"
     run_dir = runs_dir / "job-run-123"
@@ -53,8 +70,10 @@ def test_discover_run_records_basic(tmp_path: Path) -> None:
                 "status": "completed",
                 "started_at": "2024-01-01T00:00:30Z",
                 "ended_at": "2024-01-01T00:01:00Z",
+                "avg_reward": 0.75,
                 "num_examples": 10,
                 "rollouts_per_example": 2,
+                "row_count": 20,
             }
         ],
         models={"gpt-4": {"sampling_args": {"temperature": 0.2}}},
@@ -90,6 +109,8 @@ def test_discover_run_records_basic(tmp_path: Path) -> None:
     assert record.has_summary is True
     assert record.env_args == {"fold": "dev"}
     assert record.sampling_args == {"temperature": 0.2}
+    assert record.avg_reward == 0.75
+    assert record.row_count == 20
     assert record.manifest.job_run_id == "job-run-123"
 
 
@@ -129,34 +150,6 @@ def test_discover_run_records_filters_status(tmp_path: Path) -> None:
     assert filtered_none == []
 
 
-def test_discover_run_records_only_complete_runs_missing_total(tmp_path: Path) -> None:
-    runs_dir = tmp_path / "runs"
-    run_dir = runs_dir / "job-run-123"
-    results_dir = run_dir / "model-env-job"
-
-    manifest_payload = _base_manifest(
-        [
-            {
-                "job_id": "model-env-job",
-                "model_id": "gpt-4",
-                "env_id": "demo-env-module",
-                "env_template_id": "demo-env-template",
-                "env_variant_id": "demo-env",
-                "env_args": {},
-                "results_relpath": "model-env-job/results.jsonl",
-            }
-        ],
-        models={"gpt-4": {"sampling_args": {}}},
-        env_templates={"demo-env-template": {"module": "demo-env-module"}},
-    )
-    _write_json(run_dir / "run_manifest.json", manifest_payload)
-    results_dir.mkdir(parents=True, exist_ok=True)
-    (results_dir / "results.jsonl").write_text("{}", encoding="utf-8")
-
-    records = discover_run_records(runs_dir, only_complete_runs=True)
-    assert len(records) == 1
-
-
 def test_discover_run_records_missing_summary_uses_manifest_status(tmp_path: Path) -> None:
     runs_dir = tmp_path / "runs"
     run_dir = runs_dir / "job-run-123"
diff --git a/tests/test_cli/test_process_hf_sync.py b/tests/test_cli/test_process_hf_sync.py
index 0de114a4..a463e27f 100644
--- a/tests/test_cli/test_process_hf_sync.py
+++ b/tests/test_cli/test_process_hf_sync.py
@@ -1,15 +1,18 @@
 from __future__ import annotations
 
+import hashlib
 from pathlib import Path
+from types import SimpleNamespace
 
 import pytest
 
 from medarc_verifiers.cli import hf as hf_sync
+from medarc_verifiers.cli.hf import sync as hf_sync_impl
 from medarc_verifiers.cli.process.aggregate import aggregate_rows_by_env
 from medarc_verifiers.cli.process.writer import WriterConfig, write_env_groups
 
 
-def test_sync_to_hub_dry_run_builds_summary(tmp_path: Path) -> None:
+def test_sync_to_hub_dry_run_returns_none(tmp_path: Path) -> None:
     rows = [
         {"base_env_id": "env-a", "env_id": "env-a", "job_run_id": "run-1", "example_id": "ex-1", "rollout_index": 0}
     ]
@@ -30,11 +33,7 @@ def test_sync_to_hub_dry_run_builds_summary(tmp_path: Path) -> None:
         output_dir=tmp_path,
         metadata_paths=[tmp_path / "env_index.json", tmp_path / "dataset_infos.json"],
     )
-    assert summary is not None
-    assert summary.total_rows == len(rows)
-    assert summary.total_files == 3
-    assert "env_index.json" in summary.files
-    assert "dataset_infos.json" in summary.files
+    assert summary is None
 
 
 def test_sync_to_hub_uses_token(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
@@ -87,7 +86,7 @@ def create_commit(self, **_kwargs: object) -> None:
     assert captured.get("create_commit") is True
 
 
-def test_sync_to_hub_does_not_double_prefix_metadata_paths(
+def test_sync_to_hub_dry_run_with_relative_output_paths_returns_none(
     monkeypatch: pytest.MonkeyPatch,
     tmp_path: Path,
 ) -> None:
@@ -113,10 +112,179 @@ def test_sync_to_hub_does_not_double_prefix_metadata_paths(
         output_dir=output_dir,
         metadata_paths=[output_dir / "env_index.json", output_dir / "dataset_infos.json"],
     )
+    assert summary is None
+
+
+@pytest.mark.parametrize(
+    ("remote_case", "expected_pending"),
+    [
+        ("missing", {"model-a/env-a.parquet"}),
+        ("match", set()),
+        ("mismatch", {"model-a/env-a.parquet"}),
+        ("no-lfs", {"model-a/env-a.parquet"}),
+    ],
+)
+def test_compute_pending_parquet_uploads_detects_remote_state(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+    remote_case: str,
+    expected_pending: set[str],
+) -> None:
+    parquet_path = tmp_path / "model-a" / "env-a.parquet"
+    parquet_path.parent.mkdir(parents=True, exist_ok=True)
+    parquet_path.write_text("local-data", encoding="utf-8")
+    local_sha = hashlib.sha256(parquet_path.read_bytes()).hexdigest()
+
+    class FakeLFS:
+        def __init__(self, sha256: str | None) -> None:
+            self.sha256 = sha256
+
+    class FakeTreeEntry:
+        def __init__(self, path: str, lfs: object | None) -> None:
+            self.path = path
+            self.lfs = lfs
+
+    class FakeApi:
+        def __init__(self, token: str | None = None) -> None:
+            self.token = token
+
+        def list_repo_tree(self, **_kwargs: object) -> list[FakeTreeEntry]:
+            if remote_case == "missing":
+                return []
+            if remote_case == "no-lfs":
+                return [FakeTreeEntry("model-a/env-a.parquet", None)]
+            sha256 = local_sha if remote_case == "match" else "0" * 64
+            return [FakeTreeEntry("model-a/env-a.parquet", FakeLFS(sha256))]
+
+    import sys
+
+    monkeypatch.setitem(sys.modules, "huggingface_hub", SimpleNamespace(HfApi=FakeApi))
+
+    pending = hf_sync.compute_pending_parquet_uploads(
+        output_dir=tmp_path,
+        repo_id="demo/repo",
+        branch="main",
+        token="secret-token",
+    )
+
+    assert pending == expected_pending
+
+
+def test_sync_to_hub_explicit_files_uploads_exact_list(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
+    (tmp_path / "keep.parquet").write_text("1", encoding="utf-8")
+    (tmp_path / "meta.json").write_text("{}", encoding="utf-8")
+
+    captured: dict[str, object] = {}
+
+    class FakeOp:
+        def __init__(self, *args: object, **kwargs: object) -> None:
+            captured.setdefault("ops", []).append((args, kwargs))
+
+    class FakeApi:
+        def __init__(self, token: str | None = None) -> None:
+            captured["token"] = token
+
+        def create_repo(self, **_kwargs: object) -> None:
+            captured["create_repo"] = True
+
+        def create_commit(self, **kwargs: object) -> None:
+            captured["create_commit"] = kwargs
+
+    import sys
+
+    monkeypatch.setitem(sys.modules, "huggingface_hub", SimpleNamespace(CommitOperationAdd=FakeOp, HfApi=FakeApi))
+
+    summary = hf_sync.sync_to_hub(
+        [],
+        hf_sync.HFSyncConfig(repo_id="local/test", token="secret-token"),
+        output_dir=tmp_path,
+        files=[tmp_path / "keep.parquet", "meta.json"],
+    )
+
     assert summary is not None
-    assert "env_index.json" in summary.files
-    assert "dataset_infos.json" in summary.files
-    assert "runs/processed/env_index.json" not in summary.files
+    assert summary.files == ["keep.parquet", "meta.json"]
+    assert summary.total_files == 2
+    assert summary.total_rows == 0
+    assert captured["token"] == "secret-token"
+    assert captured.get("create_commit") is not None
+
+
+def test_sync_to_hub_explicit_files_respects_dry_run(tmp_path: Path) -> None:
+    (tmp_path / "keep.parquet").write_text("1", encoding="utf-8")
+
+    summary = hf_sync.sync_to_hub(
+        [],
+        hf_sync.HFSyncConfig(repo_id="local/test", dry_run=True),
+        output_dir=tmp_path,
+        files=["keep.parquet"],
+    )
+
+    assert summary is None
+
+
+@pytest.mark.parametrize("bad_path", ["/tmp/escape.txt", "../escape.txt"])
+def test_sync_files_to_hub_rejects_unsafe_paths(tmp_path: Path, bad_path: str) -> None:
+    with pytest.raises(ValueError, match="output_dir|traversal"):
+        hf_sync.sync_files_to_hub(
+            repo_id="local/test",
+            output_dir=tmp_path,
+            files=[bad_path],
+            token=None,
+            private=False,
+            message="msg",
+            dry_run=True,
+        )
+
+
+def test_transient_hf_errors_include_statuses_timeouts_and_transport() -> None:
+    import httpx
+
+    class StatusError(Exception):
+        def __init__(self, status_code: int) -> None:
+            super().__init__(f"status={status_code}")
+            self.response = SimpleNamespace(status_code=status_code)
+
+    assert hf_sync_impl._is_transient_hf_error(StatusError(429)) is True
+    assert hf_sync_impl._is_transient_hf_error(StatusError(503)) is True
+    assert hf_sync_impl._is_transient_hf_error(httpx.TimeoutException("timeout")) is True
+    assert hf_sync_impl._is_transient_hf_error(httpx.TransportError("transport")) is True
+
+
+def test_compute_pending_parquet_uploads_retries_without_expand_on_compat_error(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    parquet_path = tmp_path / "model-a" / "env-a.parquet"
+    parquet_path.parent.mkdir(parents=True, exist_ok=True)
+    parquet_path.write_text("local-data", encoding="utf-8")
+
+    calls: list[dict[str, object]] = []
+
+    class FakeApi:
+        def __init__(self, token: str | None = None) -> None:
+            self.token = token
+
+        def list_repo_tree(self, **kwargs: object):
+            calls.append(kwargs)
+            if "expand" in kwargs:
+                raise TypeError("unexpected keyword argument 'expand'")
+            return [SimpleNamespace(path="model-a/env-a.parquet", lfs=None)]
+
+    import sys
+
+    monkeypatch.setitem(sys.modules, "huggingface_hub", SimpleNamespace(HfApi=FakeApi))
+
+    pending = hf_sync.compute_pending_parquet_uploads(
+        output_dir=tmp_path,
+        repo_id="demo/repo",
+        branch="main",
+        token="secret-token",
+    )
+
+    assert pending == {"model-a/env-a.parquet"}
+    assert len(calls) == 2
+    assert "expand" in calls[0]
+    assert "expand" not in calls[1]
 
 
 def test_sync_files_to_hub_creates_repo_with_confirmation(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
@@ -174,3 +342,61 @@ def create_commit(self, **_kwargs: object) -> None:
     assert captured.get("create_repo") is not None
     assert captured.get("create_commit") is True
     assert captured["create_commit_calls"] == 2
+
+
+def test_sync_files_to_hub_skips_when_repo_creation_declined(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+    caplog: pytest.LogCaptureFixture,
+) -> None:
+    (tmp_path / "artifact.json").write_text("{}", encoding="utf-8")
+
+    captured: dict[str, object] = {"create_commit_calls": 0}
+
+    class FakeResponse:
+        status_code = 404
+
+    class FakeRepoNotFound(Exception):
+        def __init__(self) -> None:
+            super().__init__("Repository Not Found")
+            self.response = FakeResponse()
+
+    class FakeOp:
+        def __init__(self, *args: object, **kwargs: object) -> None:
+            captured["op"] = (args, kwargs)
+
+    class FakeApi:
+        def __init__(self, token: str | None = None) -> None:
+            captured["token"] = token
+
+        def create_repo(self, **kwargs: object) -> None:
+            captured["create_repo"] = kwargs
+
+        def create_commit(self, **_kwargs: object) -> None:
+            captured["create_commit_calls"] = int(captured["create_commit_calls"]) + 1
+            raise FakeRepoNotFound()
+
+    import sys
+    import types
+
+    fake_module = types.SimpleNamespace(CommitOperationAdd=FakeOp, HfApi=FakeApi)
+    monkeypatch.setitem(sys.modules, "huggingface_hub", fake_module)
+
+    with caplog.at_level("WARNING"):
+        uploaded = hf_sync.sync_files_to_hub(
+            repo_id="local/missing",
+            output_dir=tmp_path,
+            files=["artifact.json"],
+            token="secret-token",
+            private=True,
+            message="msg",
+            dry_run=False,
+            is_tty=True,
+            assume_yes=False,
+            prompt_func=lambda _prompt: "n",
+        )
+
+    assert uploaded is False
+    assert captured["create_commit_calls"] == 1
+    assert captured.get("create_repo") is None
+    assert "skipping upload because repo creation was declined" in caplog.text
diff --git a/tests/test_cli/test_process_metadata.py b/tests/test_cli/test_process_metadata.py
index 30fc1718..e69b8d46 100644
--- a/tests/test_cli/test_process_metadata.py
+++ b/tests/test_cli/test_process_metadata.py
@@ -1,8 +1,11 @@
 from __future__ import annotations
 
 import json
+import logging
 from pathlib import Path
 
+import pytest
+
 from medarc_verifiers.cli.process.discovery import RunManifestInfo, RunRecord
 from medarc_verifiers.cli.process.metadata import load_normalized_metadata
 
@@ -19,6 +22,7 @@ def _make_record(
     results_dir_name: str = "job-abc",
     env_args: dict | None = None,
     sampling_args: dict | None = None,
+    avg_reward: float | None = None,
     num_examples: int | None = 10,
     rollouts_per_example: int | None = None,
     has_metadata: bool = True,
@@ -60,8 +64,10 @@ def _make_record(
         reason=None,
         started_at="2024-01-01T00:00:10Z",
         ended_at="2024-01-01T00:00:50Z",
+        avg_reward=avg_reward,
         num_examples=num_examples,
         rollouts_per_example=rollouts_per_example,
+        row_count=1,
         env_args=env_args or {},
         sampling_args=sampling_args or {},
         env_config=env_config or {},
@@ -75,6 +81,7 @@ def test_load_normalized_metadata_prefers_manifest_fields(tmp_path: Path) -> Non
         tmp_path,
         env_args={"difficulty": "hard"},
         sampling_args={"temperature": 0.1},
+        avg_reward=0.8,
         rollouts_per_example=None,
     )
     _write_json(
@@ -84,6 +91,7 @@ def test_load_normalized_metadata_prefers_manifest_fields(tmp_path: Path) -> Non
             "model": "gpt-4o-mini",
             "env_args": {"difficulty": "easy", "split": "dev"},
             "sampling_args": {"temperature": 0.9, "top_p": 0.95},
+            "avg_reward": 0.8,
             "num_examples": 20,
             "rollouts_per_example": 2,
         },
@@ -210,3 +218,97 @@ def test_load_normalized_metadata_validation_failure_sanitizes_raw_metadata(tmp_
         "endpoint_id": "cluster-a",
         "base_url": "https://example.invalid/v1",
     }
+
+
+def test_load_normalized_metadata_keeps_zero_num_examples_from_manifest(tmp_path: Path) -> None:
+    record = _make_record(tmp_path, manifest_env_id="demo-env", num_examples=0, rollouts_per_example=1)
+    _write_json(
+        record.metadata_path,
+        {
+            "env_id": "demo-env",
+            "num_examples": 20,
+            "rollouts_per_example": 3,
+        },
+    )
+
+    normalized = load_normalized_metadata(record)
+
+    assert normalized.num_examples == 0
+    assert normalized.rollouts_per_example == 1
+
+
+def test_load_normalized_metadata_keeps_zero_rollouts_from_manifest(tmp_path: Path) -> None:
+    record = _make_record(tmp_path, manifest_env_id="demo-env", num_examples=10, rollouts_per_example=0)
+    _write_json(
+        record.metadata_path,
+        {
+            "env_id": "demo-env",
+            "num_examples": 20,
+            "rollouts_per_example": 3,
+        },
+    )
+
+    normalized = load_normalized_metadata(record)
+
+    assert normalized.num_examples == 10
+    assert normalized.rollouts_per_example == 0
+
+
+def test_load_normalized_metadata_keeps_all_examples_sentinel_from_manifest(tmp_path: Path) -> None:
+    record = _make_record(tmp_path, manifest_env_id="demo-env", num_examples=-1, rollouts_per_example=1)
+    _write_json(
+        record.metadata_path,
+        {
+            "env_id": "demo-env",
+            "num_examples": 20,
+            "rollouts_per_example": 3,
+        },
+    )
+
+    normalized = load_normalized_metadata(record)
+
+    assert normalized.num_examples == -1
+    assert normalized.rollouts_per_example == 1
+
+
+def test_load_normalized_metadata_warns_on_avg_reward_and_num_examples_mismatch(
+    tmp_path: Path,
+    caplog: pytest.LogCaptureFixture,
+) -> None:
+    record = _make_record(tmp_path, manifest_env_id="demo-env", avg_reward=0.8, num_examples=10)
+    _write_json(
+        record.metadata_path,
+        {
+            "env_id": "demo-env",
+            "avg_reward": 0.7,
+            "num_examples": 12,
+        },
+    )
+
+    with caplog.at_level(logging.WARNING):
+        normalized = load_normalized_metadata(record)
+
+    assert normalized.num_examples == 10
+    assert "Manifest/metadata result mismatch for process input" in caplog.text
+    assert "avg_reward manifest=0.8 metadata=0.7" in caplog.text
+    assert "num_examples manifest=10 metadata=12" in caplog.text
+
+
+def test_load_normalized_metadata_does_not_warn_when_result_fields_match(
+    tmp_path: Path,
+    caplog: pytest.LogCaptureFixture,
+) -> None:
+    record = _make_record(tmp_path, manifest_env_id="demo-env", avg_reward=0.8, num_examples=10)
+    _write_json(
+        record.metadata_path,
+        {
+            "env_id": "demo-env",
+            "avg_reward": 0.8,
+            "num_examples": 10,
+        },
+    )
+
+    with caplog.at_level(logging.WARNING):
+        load_normalized_metadata(record)
+
+    assert "Manifest/metadata result mismatch for process input" not in caplog.text
diff --git a/tests/test_cli/test_process_pipeline.py b/tests/test_cli/test_process_pipeline.py
index b55eaf0a..51d0d7af 100644
--- a/tests/test_cli/test_process_pipeline.py
+++ b/tests/test_cli/test_process_pipeline.py
@@ -8,10 +8,13 @@
 
 from medarc_verifiers.cli._manifest import MANIFEST_VERSION
 from medarc_verifiers.cli._schemas import EnvironmentExportConfig
+from medarc_verifiers.cli.hf import HFSyncConfig
 from medarc_verifiers.cli.process import ProcessOptions, run_process
+from medarc_verifiers.cli.process import workspace
+from medarc_verifiers.cli.process.discovery import RunManifestInfo, RunRecord, discover_run_records
+from medarc_verifiers.cli.process.pipeline import select_work_items
 from medarc_verifiers.cli.winrate import WinrateConfig
 from medarc_verifiers.cli.winrate import discover_datasets, run_winrate
-from medarc_verifiers.cli.hf import HFSyncConfig
 from medarc_verifiers.cli.process.writer import ALLOWED_COLUMNS
 
 
@@ -20,6 +23,82 @@ def _write_json(path: Path, payload: dict) -> None:
     path.write_text(json.dumps(payload), encoding="utf-8")
 
 
+def _manifest_info(
+    *,
+    run_id: str,
+    completed: int,
+    total: int,
+    total_known: bool = True,
+    updated_at: str = "2024-01-01T00:00:00Z",
+) -> RunManifestInfo:
+    run_dir = Path("/tmp") / run_id
+    return RunManifestInfo(
+        job_run_id=run_id,
+        run_name=run_id,
+        summary_completed=completed,
+        summary_total=total,
+        summary_total_known=total_known,
+        manifest_path=run_dir / "run_manifest.json",
+        run_dir=run_dir,
+        created_at="2024-01-01T00:00:00Z",
+        updated_at=updated_at,
+        config_source="configs/demo.yaml",
+        config_checksum="abc123",
+        run_summary_path=run_dir / "run_summary.json",
+    )
+
+
+def _run_record(
+    *,
+    run_id: str,
+    job_id: str,
+    env_id: str,
+    model_id: str = "gpt-mini",
+    completed: int = 1,
+    total: int = 1,
+    total_known: bool = True,
+    updated_at: str = "2024-01-01T00:00:00Z",
+    row_count: int | None = 1,
+    num_examples: int | None = 1,
+    rollouts_per_example: int | None = 1,
+) -> RunRecord:
+    run_dir = Path("/tmp") / run_id
+    results_dir = run_dir / job_id
+    return RunRecord(
+        manifest=_manifest_info(
+            run_id=run_id,
+            completed=completed,
+            total=total,
+            total_known=total_known,
+            updated_at=updated_at,
+        ),
+        job_id=job_id,
+        model_id=model_id,
+        manifest_env_id=env_id,
+        results_dir_name=job_id,
+        results_dir=results_dir,
+        metadata_path=results_dir / "metadata.json",
+        results_path=results_dir / "results.jsonl",
+        summary_path=results_dir / "summary.json",
+        has_metadata=False,
+        has_results=True,
+        has_summary=True,
+        status="completed",
+        duration_seconds=1.0,
+        reason=None,
+        started_at="2024-01-01T00:00:00Z",
+        ended_at="2024-01-01T00:00:01Z",
+        avg_reward=1.0,
+        num_examples=num_examples,
+        rollouts_per_example=rollouts_per_example,
+        row_count=row_count,
+        env_args={},
+        sampling_args={},
+        env_config={"id": env_id, "module": env_id},
+        model_config={},
+    )
+
+
 def _setup_run(tmp_path: Path) -> Path:
     runs_dir = tmp_path / "runs"
     run_dir = runs_dir / "run-1"
@@ -52,6 +131,10 @@ def _setup_run(tmp_path: Path) -> Path:
                 "env_variant_id": "demo-env-rollout3",
                 "env_args": {},
                 "results_dir": "demo-job",
+                "status": "completed",
+                "num_examples": 1,
+                "rollouts_per_example": 1,
+                "row_count": 1,
             }
         ],
     }
@@ -93,10 +176,17 @@ def _write_run(
     reward: float,
     env_id: str = "demo-env-rollout3",
     model_id: str = "gpt-mini",
+    status: str = "completed",
+    results_text: str | None = None,
+    row_count: int | None = 1,
+    num_examples: int | None = 1,
+    rollouts_per_example: int | None = 1,
+    write_results: bool = True,
+    job_id: str = "demo-job",
 ) -> Path:
     runs_dir = tmp_path / "runs"
     run_dir = runs_dir / run_id
-    results_dir = run_dir / "demo-job"
+    results_dir = run_dir / job_id
     manifest = {
         "version": MANIFEST_VERSION,
         "run_id": run_id,
@@ -110,21 +200,25 @@ def _write_run(
         "env_templates": {"demo-env-template": {"module": env_id}},
         "summary": {
             "total": 1,
-            "completed": 1,
+            "completed": 1 if status == "completed" else 0,
             "pending": 0,
             "running": 0,
-            "failed": 0,
+            "failed": 1 if status == "failed" else 0,
             "skipped": 0,
         },
         "jobs": [
             {
-                "job_id": "demo-job",
+                "job_id": job_id,
                 "model_id": model_id,
                 "env_id": env_id,
                 "env_template_id": "demo-env-template",
                 "env_variant_id": env_id,
                 "env_args": {},
-                "results_dir": "demo-job",
+                "results_dir": job_id,
+                "status": status,
+                "row_count": row_count,
+                "num_examples": num_examples,
+                "rollouts_per_example": rollouts_per_example,
             }
         ],
     }
@@ -133,15 +227,34 @@ def _write_run(
         "env_id": env_id,
         "env_args": {},
         "sampling_args": {},
+        "num_examples": num_examples,
+        "rollouts_per_example": rollouts_per_example,
     }
     _write_json(results_dir / "metadata.json", metadata)
     results_path = results_dir / "results.jsonl"
-    results_path.parent.mkdir(parents=True, exist_ok=True)
-    row = {"example_id": f"ex-{run_id}", "reward": reward}
-    results_path.write_text(json.dumps(row) + "\n", encoding="utf-8")
+    if write_results:
+        results_path.parent.mkdir(parents=True, exist_ok=True)
+        if results_text is None:
+            row = {"example_id": f"ex-{run_id}", "reward": reward}
+            results_text = json.dumps(row) + "\n"
+        results_path.write_text(results_text, encoding="utf-8")
     return runs_dir
 
 
+def _remove_model_id(tmp_path: Path, run_id: str) -> None:
+    manifest_path = tmp_path / "runs" / run_id / "run_manifest.json"
+    manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
+    manifest["jobs"][0]["model_id"] = None
+    manifest["models"] = {}
+    manifest_path.write_text(json.dumps(manifest), encoding="utf-8")
+
+    job_id = manifest["jobs"][0]["job_id"]
+    metadata_path = tmp_path / "runs" / run_id / job_id / "metadata.json"
+    metadata = json.loads(metadata_path.read_text(encoding="utf-8"))
+    metadata.pop("model", None)
+    metadata_path.write_text(json.dumps(metadata), encoding="utf-8")
+
+
 def test_run_process_respects_env_export_defaults(tmp_path: Path) -> None:
     runs_dir = _setup_run(tmp_path)
     options = ProcessOptions(
@@ -205,6 +318,24 @@ def test_run_process_writes_version_info_column(tmp_path: Path) -> None:
     assert payload["vf_version"] == "0.1.10"
 
 
+def test_run_process_preserves_string_example_id_in_parquet(tmp_path: Path) -> None:
+    runs_dir = _setup_run(tmp_path)
+    output_dir = tmp_path / "processed"
+
+    result = run_process(
+        ProcessOptions(
+            runs_dir=runs_dir,
+            output_dir=output_dir,
+            dry_run=False,
+            max_workers=1,
+        )
+    )
+
+    table = pq.read_table(result.env_summaries[0].output_path)
+    assert table.column("example_id").to_pylist() == ["ex-1"]
+    assert str(table.schema.field("example_id").type) == "large_string"
+
+
 def test_run_process_backward_compat_without_version_info(tmp_path: Path) -> None:
     runs_dir = _write_run(
         tmp_path,
@@ -260,6 +391,324 @@ def test_run_process_excludes_datasets(tmp_path: Path) -> None:
     assert result.env_groups[0].base_env_id == "keep-env"
 
 
+def test_process_allows_results_missing_pct_within_threshold(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-98pct",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=1.0,
+        row_count=98,
+        num_examples=100,
+        rollouts_per_example=1,
+    )
+    options = ProcessOptions(
+        runs_dir=runs_dir,
+        output_dir=tmp_path / "processed",
+        max_results_missing_pct=2.5,
+        dry_run=True,
+        max_workers=1,
+    )
+
+    result = run_process(options)
+
+    assert result.records_processed == 1
+    assert result.rows_processed == 1
+
+
+def test_process_rejects_results_missing_pct_above_threshold(tmp_path: Path) -> None:
+    results_text = "".join(json.dumps({"example_id": f"ex-{index}", "reward": 1.0}) + "\n" for index in range(90))
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-90pct",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=1.0,
+        row_count=90,
+        num_examples=100,
+        rollouts_per_example=1,
+        results_text=results_text,
+    )
+    options = ProcessOptions(
+        runs_dir=runs_dir,
+        output_dir=tmp_path / "processed",
+        max_results_missing_pct=2.5,
+        dry_run=True,
+        max_workers=1,
+    )
+
+    with pytest.raises(RuntimeError) as excinfo:
+        run_process(options)
+
+    message = str(excinfo.value)
+    assert "run-90pct" in message
+    assert "expected_rows=100" in message
+    assert "observed_rows=90" in message
+    assert "missing_pct=10.00" in message
+    assert "threshold=2.5" in message
+
+
+def test_process_allows_ungateable_record_when_expected_rows_unknown(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-unknown-expected",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=1.0,
+        row_count=10,
+        num_examples=None,
+        rollouts_per_example=1,
+    )
+    options = ProcessOptions(
+        runs_dir=runs_dir,
+        output_dir=tmp_path / "processed",
+        dry_run=True,
+        max_workers=1,
+    )
+
+    result = run_process(options)
+
+    assert result.records_processed == 1
+
+
+def test_process_allows_ungateable_record_when_row_count_unknown(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-unknown-observed",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=1.0,
+        row_count=None,
+        num_examples=100,
+        rollouts_per_example=1,
+    )
+    options = ProcessOptions(
+        runs_dir=runs_dir,
+        output_dir=tmp_path / "processed",
+        dry_run=True,
+        max_workers=1,
+    )
+
+    result = run_process(options)
+
+    assert result.records_processed == 1
+
+
+def test_process_latest_record_that_fails_gate_does_not_fall_back(tmp_path: Path) -> None:
+    _write_run(
+        tmp_path,
+        run_id="run-older-ok",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=1.0,
+        row_count=100,
+        num_examples=100,
+        rollouts_per_example=1,
+    )
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-newer-bad",
+        updated_at="2024-01-02T00:00:00Z",
+        reward=0.0,
+        row_count=90,
+        num_examples=100,
+        rollouts_per_example=1,
+    )
+    options = ProcessOptions(
+        runs_dir=runs_dir,
+        output_dir=tmp_path / "processed",
+        max_results_missing_pct=2.5,
+        dry_run=True,
+        max_workers=1,
+    )
+
+    with pytest.raises(RuntimeError) as excinfo:
+        run_process(options)
+
+    message = str(excinfo.value)
+    assert "run-newer-bad" in message
+    assert "run-older-ok" not in message
+
+
+def test_process_rejects_missing_results_jsonl_for_selected_latest_record(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-missing-results",
+        updated_at="2024-01-02T00:00:00Z",
+        reward=1.0,
+        row_count=100,
+        num_examples=100,
+        rollouts_per_example=1,
+        write_results=False,
+    )
+    options = ProcessOptions(
+        runs_dir=runs_dir,
+        output_dir=tmp_path / "processed",
+        dry_run=True,
+        max_workers=1,
+    )
+
+    with pytest.raises(RuntimeError) as excinfo:
+        run_process(options)
+
+    message = str(excinfo.value)
+    assert "missing results.jsonl files" in message
+    assert "run-missing-results" in message
+
+
+def test_process_gate_ignores_excluded_record(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-excluded-bad",
+        updated_at="2024-01-02T00:00:00Z",
+        reward=1.0,
+        env_id="skip-env",
+        row_count=90,
+        num_examples=100,
+        rollouts_per_example=1,
+    )
+    options = ProcessOptions(
+        runs_dir=runs_dir,
+        output_dir=tmp_path / "processed",
+        exclude_datasets=("skip-env",),
+        max_results_missing_pct=2.5,
+        dry_run=True,
+        max_workers=1,
+    )
+
+    result = run_process(options)
+
+    assert result.records_processed == 0
+
+
+def test_process_stale_delta_output_does_not_mask_newer_incomplete_run(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-initial",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=1.0,
+        row_count=100,
+        num_examples=100,
+        rollouts_per_example=1,
+    )
+    output_dir = tmp_path / "processed"
+    initial = run_process(ProcessOptions(runs_dir=runs_dir, output_dir=output_dir, dry_run=False, max_workers=1))
+    assert initial.records_processed == 1
+
+    results_text = "".join(json.dumps({"example_id": f"ex-{index}", "reward": 0.0}) + "\n" for index in range(90))
+    _write_run(
+        tmp_path,
+        run_id="run-newer-bad",
+        updated_at="2024-01-02T00:00:00Z",
+        reward=0.0,
+        row_count=90,
+        num_examples=100,
+        rollouts_per_example=1,
+        results_text=results_text,
+    )
+
+    with pytest.raises(RuntimeError) as excinfo:
+        run_process(
+            ProcessOptions(
+                runs_dir=runs_dir,
+                output_dir=output_dir,
+                max_results_missing_pct=2.5,
+                dry_run=False,
+                max_workers=1,
+            )
+        )
+
+    message = str(excinfo.value)
+    assert "run-newer-bad" in message
+    assert "missing_pct=10.00" in message
+
+
+def test_process_emits_single_warning_for_ungateable_selected_records(
+    tmp_path: Path,
+    caplog: pytest.LogCaptureFixture,
+) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-unknown-observed",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=1.0,
+        row_count=None,
+        num_examples=100,
+        rollouts_per_example=1,
+    )
+    caplog.set_level("WARNING")
+
+    result = run_process(
+        ProcessOptions(
+            runs_dir=runs_dir,
+            output_dir=tmp_path / "processed",
+            dry_run=True,
+            max_workers=1,
+        )
+    )
+
+    assert result.records_processed == 1
+    warnings = [
+        record for record in caplog.records if "Results row completeness gate could not be applied" in record.msg
+    ]
+    assert len(warnings) == 1
+
+
+def test_process_uses_actual_results_rows_when_manifest_row_count_is_stale(
+    tmp_path: Path,
+    caplog: pytest.LogCaptureFixture,
+) -> None:
+    results_text = "".join(json.dumps({"example_id": f"ex-{index}", "reward": 1.0}) + "\n" for index in range(100))
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-stale-row-count",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=1.0,
+        row_count=90,
+        num_examples=100,
+        rollouts_per_example=1,
+        results_text=results_text,
+    )
+    caplog.set_level("WARNING")
+
+    result = run_process(
+        ProcessOptions(
+            runs_dir=runs_dir,
+            output_dir=tmp_path / "processed",
+            dry_run=True,
+            max_workers=1,
+        )
+    )
+
+    assert result.records_processed == 1
+    assert "Manifest row_count mismatch for process input" in caplog.text
+    assert "manifest row_count=90 actual_rows=100" in caplog.text
+
+
+def test_select_work_items_rollout_gate_error_includes_output_and_manifest_ids(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-rollout-bad",
+        updated_at="2024-01-02T00:00:00Z",
+        reward=1.0,
+        env_id="demo-env-rollout3",
+        row_count=90,
+        num_examples=100,
+        rollouts_per_example=1,
+    )
+    discovered = discover_run_records(runs_dir, filter_status=("completed",))
+    options = ProcessOptions(
+        runs_dir=runs_dir,
+        output_dir=tmp_path / "processed",
+        max_results_missing_pct=2.5,
+        dry_run=True,
+        max_workers=1,
+    )
+
+    with pytest.raises(RuntimeError) as excinfo:
+        select_work_items(discovered, options=options, env_export_map={}, index_files={})
+
+    message = str(excinfo.value)
+    assert "output_env_id=demo-env" in message
+    assert "manifest_env_id=demo-env-rollout3" in message
+    assert "job_id=demo-job" in message
+
+
 def test_run_process_excludes_models(tmp_path: Path) -> None:
     _write_run(
         tmp_path,
@@ -516,7 +965,7 @@ def test_run_process_empty_runs_returns_result(tmp_path: Path) -> None:
     assert result.hf_summary is None
 
 
-def test_process_latest_only_selects_latest_and_delta_skips(tmp_path: Path) -> None:
+def test_process_latest_only_selects_latest_and_skips_existing_outputs(tmp_path: Path) -> None:
     runs_dir = _write_run(tmp_path, run_id="run-1", updated_at="2024-01-01T00:00:00Z", reward=0.1)
     _write_run(tmp_path, run_id="run-2", updated_at="2024-01-02T00:00:00Z", reward=0.9)
     output_dir = tmp_path / "processed"
@@ -540,6 +989,385 @@ def test_process_latest_only_selects_latest_and_delta_skips(tmp_path: Path) -> N
     assert result_repeat.env_summaries == []
     assert result_repeat.rows_processed == 0
 
+    _write_run(tmp_path, run_id="run-3", updated_at="2024-01-04T00:00:00Z", reward=0.4)
+    result_newer_raw = run_process(options)
+    assert result_newer_raw.env_summaries
+    newer_table = pq.read_table(result_newer_raw.env_summaries[0].output_path)
+    assert set(newer_table.column("job_run_id").to_pylist()) == {"run-3"}
+    assert newer_table.column("reward").to_pylist() == [0.4]
+
+
+def test_run_process_continue_upload_syncs_pending_parquets_without_new_deltas(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    runs_dir = _write_run(tmp_path, run_id="run-1", updated_at="2024-01-01T00:00:00Z", reward=0.1, env_id="demo-env")
+    output_dir = tmp_path / "processed"
+
+    first_result = run_process(
+        ProcessOptions(
+            runs_dir=runs_dir,
+            output_dir=output_dir,
+            dry_run=False,
+            max_workers=1,
+        )
+    )
+    pending_path = first_result.env_summaries[0].output_path.relative_to(output_dir).as_posix()
+    captured: dict[str, object] = {}
+
+    def fake_prepare_output_workspace(**_kwargs: object) -> workspace.WorkspacePreparationResult:
+        return workspace.WorkspacePreparationResult(
+            baseline_result=workspace.BaselineResult(
+                policy="continue-upload",
+                pending_parquet_uploads={pending_path},
+            )
+        )
+
+    def fake_sync_to_hub(
+        env_summaries,
+        config,
+        *,
+        output_dir,
+        metadata_paths=None,
+        files=None,
+        **_kwargs,
+    ):
+        captured["env_summaries"] = list(env_summaries)
+        captured["files"] = list(files or [])
+        return None
+
+    monkeypatch.setattr("medarc_verifiers.cli.process.pipeline.workspace.prepare_output_workspace", fake_prepare_output_workspace)
+    monkeypatch.setattr("medarc_verifiers.cli.process.pipeline.hf_sync.sync_to_hub", fake_sync_to_hub)
+
+    result = run_process(
+        ProcessOptions(
+            runs_dir=runs_dir,
+            output_dir=output_dir,
+            dry_run=False,
+            max_workers=1,
+            hf_config=HFSyncConfig(repo_id="demo/repo"),
+            hf_pull_policy="continue-upload",
+        )
+    )
+
+    assert result.env_summaries == []
+    assert captured["env_summaries"] == []
+    assert captured["files"] == [pending_path]
+
+
+def test_run_process_continue_upload_unions_pending_and_current_touched_files(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    runs_dir = _write_run(tmp_path, run_id="run-1", updated_at="2024-01-01T00:00:00Z", reward=0.1, env_id="demo-env")
+    output_dir = tmp_path / "processed"
+    first_result = run_process(
+        ProcessOptions(
+            runs_dir=runs_dir,
+            output_dir=output_dir,
+            dry_run=False,
+            max_workers=1,
+        )
+    )
+    current_path = first_result.env_summaries[0].output_path.relative_to(output_dir).as_posix()
+    pending_path = "stale-model/stale-env.parquet"
+    stale_path = output_dir / pending_path
+    stale_path.parent.mkdir(parents=True, exist_ok=True)
+    stale_path.write_text("stale", encoding="utf-8")
+    _write_run(tmp_path, run_id="run-2", updated_at="2024-01-02T00:00:00Z", reward=0.9, env_id="demo-env")
+
+    captured: dict[str, object] = {}
+
+    def fake_prepare_output_workspace(**_kwargs: object) -> workspace.WorkspacePreparationResult:
+        return workspace.WorkspacePreparationResult(
+            baseline_result=workspace.BaselineResult(
+                policy="continue-upload",
+                pending_parquet_uploads={pending_path},
+            )
+        )
+
+    def fake_sync_to_hub(
+        env_summaries,
+        config,
+        *,
+        output_dir,
+        metadata_paths=None,
+        files=None,
+        **_kwargs,
+    ):
+        captured["files"] = list(files or [])
+        return None
+
+    monkeypatch.setattr("medarc_verifiers.cli.process.pipeline.workspace.prepare_output_workspace", fake_prepare_output_workspace)
+    monkeypatch.setattr("medarc_verifiers.cli.process.pipeline.hf_sync.sync_to_hub", fake_sync_to_hub)
+
+    result = run_process(
+        ProcessOptions(
+            runs_dir=runs_dir,
+            output_dir=output_dir,
+            dry_run=False,
+            max_workers=1,
+            hf_config=HFSyncConfig(repo_id="demo/repo"),
+            hf_pull_policy="continue-upload",
+        )
+    )
+
+    assert result.env_summaries
+    assert set(captured["files"]) == {pending_path, current_path, "dataset_infos.json", "env_index.json"}
+
+
+def test_process_replace_model_rebuilds_existing_output(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-1",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=0.1,
+        env_id="demo-env",
+        model_id="model-a",
+    )
+    _write_run(
+        tmp_path,
+        run_id="run-2",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=0.2,
+        env_id="demo-env",
+        model_id="model-b",
+    )
+    output_dir = tmp_path / "processed"
+
+    run_process(ProcessOptions(runs_dir=runs_dir, output_dir=output_dir, dry_run=False, max_workers=1))
+    _write_run(
+        tmp_path,
+        run_id="run-3",
+        updated_at="2024-01-03T00:00:00Z",
+        reward=0.9,
+        env_id="demo-env",
+        model_id="model-a",
+    )
+    _write_run(
+        tmp_path,
+        run_id="run-4",
+        updated_at="2024-01-03T00:00:00Z",
+        reward=0.8,
+        env_id="demo-env",
+        model_id="model-b",
+    )
+
+    result = run_process(
+        ProcessOptions(
+            runs_dir=runs_dir,
+            output_dir=output_dir,
+            replace_models=("model-a",),
+            dry_run=False,
+            max_workers=1,
+        )
+    )
+
+    rebuilt = {summary.model_id for summary in result.env_summaries}
+    assert rebuilt == {"model-a", "model-b"}
+    model_a_table = pq.read_table(output_dir / "model-a" / "demo-env.parquet")
+    model_b_table = pq.read_table(output_dir / "model-b" / "demo-env.parquet")
+    assert model_a_table.column("reward").to_pylist() == [0.9]
+    assert model_b_table.column("reward").to_pylist() == [0.8]
+
+
+def test_process_replace_model_and_env_rebuild_only_intersection(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-1",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=0.1,
+        env_id="env-a",
+        model_id="model-a",
+    )
+    _write_run(
+        tmp_path,
+        run_id="run-2",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=0.2,
+        env_id="env-b",
+        model_id="model-a",
+    )
+    _write_run(
+        tmp_path,
+        run_id="run-3",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=0.3,
+        env_id="env-a",
+        model_id="model-b",
+    )
+    output_dir = tmp_path / "processed"
+    run_process(ProcessOptions(runs_dir=runs_dir, output_dir=output_dir, dry_run=False, max_workers=1))
+
+    _write_run(
+        tmp_path,
+        run_id="run-4",
+        updated_at="2024-01-03T00:00:00Z",
+        reward=0.7,
+        env_id="env-a",
+        model_id="model-a",
+    )
+    _write_run(
+        tmp_path,
+        run_id="run-5",
+        updated_at="2024-01-03T00:00:00Z",
+        reward=0.8,
+        env_id="env-b",
+        model_id="model-a",
+    )
+    _write_run(
+        tmp_path,
+        run_id="run-6",
+        updated_at="2024-01-03T00:00:00Z",
+        reward=0.9,
+        env_id="env-a",
+        model_id="model-b",
+    )
+
+    result = run_process(
+        ProcessOptions(
+            runs_dir=runs_dir,
+            output_dir=output_dir,
+            replace_models=("model-a",),
+            replace_envs=("env-a",),
+            dry_run=False,
+            max_workers=1,
+        )
+    )
+
+    assert {(summary.model_id, summary.env_id) for summary in result.env_summaries} == {
+        ("model-a", "env-a"),
+        ("model-a", "env-b"),
+        ("model-b", "env-a"),
+    }
+    assert pq.read_table(output_dir / "model-a" / "env-a.parquet").column("reward").to_pylist() == [0.7]
+    assert pq.read_table(output_dir / "model-a" / "env-b.parquet").column("reward").to_pylist() == [0.8]
+    assert pq.read_table(output_dir / "model-b" / "env-a.parquet").column("reward").to_pylist() == [0.9]
+
+
+def test_process_fails_fast_on_existing_row_count_mismatch(tmp_path: Path) -> None:
+    runs_dir = _setup_run(tmp_path)
+    output_dir = tmp_path / "processed"
+    result = run_process(ProcessOptions(runs_dir=runs_dir, output_dir=output_dir, dry_run=False, max_workers=1))
+    summary = result.env_summaries[0]
+    rel_path = summary.output_path.relative_to(output_dir).as_posix()
+    payload = json.loads((output_dir / "env_index.json").read_text(encoding="utf-8"))
+    payload["files"][rel_path]["row_count"] = summary.row_count + 1
+    (output_dir / "env_index.json").write_text(json.dumps(payload), encoding="utf-8")
+
+    with pytest.raises(RuntimeError, match="env_index.json records"):
+        run_process(ProcessOptions(runs_dir=runs_dir, output_dir=output_dir, dry_run=False, max_workers=1))
+
+
+def test_process_ignores_invalid_superseded_run(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-1",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=0.1,
+        results_text='{"example_id": ',
+    )
+    _write_run(tmp_path, run_id="run-2", updated_at="2024-01-02T00:00:00Z", reward=0.9)
+    output_dir = tmp_path / "processed"
+
+    result = run_process(ProcessOptions(runs_dir=runs_dir, output_dir=output_dir, dry_run=False, max_workers=1))
+
+    assert result.env_summaries
+    table = pq.read_table(result.env_summaries[0].output_path)
+    assert table.column("reward").to_pylist() == [0.9]
+
+
+def test_process_ignores_superseded_run_missing_model_id(tmp_path: Path) -> None:
+    runs_dir = _write_run(tmp_path, run_id="run-1", updated_at="2024-01-01T00:00:00Z", reward=0.1)
+    _remove_model_id(tmp_path, "run-1")
+    _write_run(tmp_path, run_id="run-2", updated_at="2024-01-02T00:00:00Z", reward=0.9)
+
+    result = run_process(
+        ProcessOptions(runs_dir=runs_dir, output_dir=tmp_path / "processed", dry_run=False, max_workers=1)
+    )
+
+    table = pq.read_table(result.env_summaries[0].output_path)
+    assert table.column("reward").to_pylist() == [0.9]
+
+
+def test_process_latest_missing_model_id_fails_clearly(tmp_path: Path) -> None:
+    runs_dir = _write_run(tmp_path, run_id="run-1", updated_at="2024-01-01T00:00:00Z", reward=0.1)
+    _write_run(tmp_path, run_id="run-2", updated_at="2024-01-02T00:00:00Z", reward=0.9)
+    _remove_model_id(tmp_path, "run-2")
+
+    with pytest.raises(RuntimeError, match=r"Missing model_id for run \(job_run_id=run-2, job_id=demo-job,"):
+        run_process(ProcessOptions(runs_dir=runs_dir, output_dir=tmp_path / "processed", dry_run=False, max_workers=1))
+
+
+def test_process_latest_missing_model_id_not_masked_by_newer_other_job(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-model-a-old",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=0.1,
+        model_id="model-a",
+        job_id="job-model-a",
+    )
+    _write_run(
+        tmp_path,
+        run_id="run-model-a-bad",
+        updated_at="2024-01-02T00:00:00Z",
+        reward=0.2,
+        model_id="model-a",
+        job_id="job-model-a",
+    )
+    _remove_model_id(tmp_path, "run-model-a-bad")
+    _write_run(
+        tmp_path,
+        run_id="run-model-b-good",
+        updated_at="2024-01-03T00:00:00Z",
+        reward=0.9,
+        model_id="model-b",
+        job_id="job-model-b",
+    )
+
+    with pytest.raises(RuntimeError, match=r"Missing model_id for run \(job_run_id=run-model-a-bad, job_id=job-model-a,"):
+        run_process(ProcessOptions(runs_dir=runs_dir, output_dir=tmp_path / "processed", dry_run=False, max_workers=1))
+
+
+def test_process_ignores_invalid_incomplete_run_by_default(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-1",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=0.1,
+        status="running",
+        results_text='{"example_id": ',
+    )
+    _write_run(tmp_path, run_id="run-2", updated_at="2024-01-02T00:00:00Z", reward=0.9, env_id="other-env")
+    output_dir = tmp_path / "processed"
+
+    result = run_process(ProcessOptions(runs_dir=runs_dir, output_dir=output_dir, dry_run=False, max_workers=1))
+
+    assert {summary.env_id for summary in result.env_summaries} == {"other-env"}
+
+
+def test_process_selected_invalid_results_still_fail(tmp_path: Path) -> None:
+    runs_dir = _write_run(
+        tmp_path,
+        run_id="run-1",
+        updated_at="2024-01-01T00:00:00Z",
+        reward=0.1,
+        results_text='{"example_id": ',
+    )
+
+    with pytest.raises(ValueError, match="Failed to parse JSONL line 1"):
+        run_process(ProcessOptions(runs_dir=runs_dir, output_dir=tmp_path / "processed", dry_run=False, max_workers=1))
+
+
+def test_process_selected_missing_results_still_fail(tmp_path: Path) -> None:
+    runs_dir = _setup_run(tmp_path)
+    missing_results = runs_dir / "run-1" / "demo-job" / "results.jsonl"
+    missing_results.unlink()
+
+    with pytest.raises(RuntimeError, match="Selected records are missing results.jsonl files:"):
+        run_process(ProcessOptions(runs_dir=runs_dir, output_dir=tmp_path / "processed", dry_run=False, max_workers=1))
+
 
 def test_process_clean_clears_outputs(tmp_path: Path) -> None:
     runs_dir = _setup_run(tmp_path)
@@ -563,6 +1391,63 @@ def test_process_clean_clears_outputs(tmp_path: Path) -> None:
     assert (output_dir / "env_index.json").exists()
 
 
+def test_run_process_reads_local_index_after_workspace_prep(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    runs_dir = _setup_run(tmp_path)
+    output_dir = tmp_path / "processed"
+    observed: list[str] = []
+
+    def fake_prepare_output_workspace(**kwargs):
+        observed.append("workspace")
+        model_dir = kwargs["output_dir"] / "gpt-mini"
+        model_dir.mkdir(parents=True, exist_ok=True)
+        existing_path = model_dir / "demo-env.parquet"
+        existing_path.write_text("placeholder", encoding="utf-8")
+        (kwargs["output_dir"] / "env_index.json").write_text(
+            json.dumps(
+                {
+                    "version": 2,
+                    "files": {
+                        "gpt-mini/demo-env.parquet": {
+                            "env_id": "demo-env",
+                            "model_id": "gpt-mini",
+                        }
+                    },
+                }
+            ),
+            encoding="utf-8",
+        )
+
+    def fake_read_env_index_files(processed_dir: Path):
+        observed.append("index")
+        assert observed == ["workspace", "index"]
+        return {"gpt-mini/demo-env.parquet": {"env_id": "demo-env", "model_id": "gpt-mini"}}
+
+    monkeypatch.setattr(
+        "medarc_verifiers.cli.process.workspace.prepare_output_workspace", fake_prepare_output_workspace
+    )
+    monkeypatch.setattr("medarc_verifiers.cli.process.env_index.read_env_index_files", fake_read_env_index_files)
+    monkeypatch.setattr(
+        "medarc_verifiers.cli.process.pipeline._read_existing_output_metadata",
+        lambda *_args, **_kwargs: object(),
+    )
+    monkeypatch.setattr(
+        "medarc_verifiers.cli.process.pipeline._validate_existing_output_integrity",
+        lambda *_args, **_kwargs: None,
+    )
+    monkeypatch.setattr(
+        "medarc_verifiers.cli.process.pipeline._existing_output_matches_selected_runs",
+        lambda *_args, **_kwargs: True,
+    )
+
+    result = run_process(ProcessOptions(runs_dir=runs_dir, output_dir=output_dir, dry_run=False, max_workers=1))
+
+    assert observed == ["workspace", "index"]
+    assert result.env_summaries == []
+
+
 def test_run_process_ignores_legacy_run_output_path(tmp_path: Path) -> None:
     runs_dir = _setup_run(tmp_path)
     run_dir = runs_dir / "run-1"
diff --git a/tests/test_cli/test_process_rows.py b/tests/test_cli/test_process_rows.py
index 1a5c8efc..11cfdb86 100644
--- a/tests/test_cli/test_process_rows.py
+++ b/tests/test_cli/test_process_rows.py
@@ -52,8 +52,10 @@ def _build_record(tmp_path: Path, *, status: str = "completed", reason: str | No
         reason=reason,
         started_at="2024-05-01T00:00:30Z",
         ended_at="2024-05-01T00:00:42Z",
+        avg_reward=0.5,
         num_examples=10,
         rollouts_per_example=1,
+        row_count=1,
         env_args={"split": "dev", "extra_body": {}},
         sampling_args={"temperature": 0.2},
         env_config={},
diff --git a/tests/test_cli/test_process_winrate.py b/tests/test_cli/test_process_winrate.py
index 3b885e69..a29e8bed 100644
--- a/tests/test_cli/test_process_winrate.py
+++ b/tests/test_cli/test_process_winrate.py
@@ -360,6 +360,48 @@ def test_partial_datasets_include_uses_consistent_canonical_labels(tmp_path: Pat
     assert payload["models"]["Model_A"]["vs"]["Model_B"]["n_datasets"] == 2
 
 
+def test_compute_dataset_missingness_counts_null_rewards() -> None:
+    df_avg = pl.DataFrame(
+        {
+            "example_id": ["q1", "q2", "q3", "q1", "q2", "q3"],
+            "model_id": ["model_a", "model_a", "model_a", "model_b", "model_b", "model_b"],
+            "reward_mean": [1.0, 0.5, 0.0, 0.8, None, 0.2],
+        }
+    )
+
+    rows = winrate_api._compute_dataset_missingness("dataset", df_avg, ["model_a", "model_b"])
+    by_model = {row.model: row for row in rows}
+
+    assert by_model["model_a"].expected_n == 3
+    assert by_model["model_a"].present_nonnull_n == 3
+    assert by_model["model_a"].missing_count == 0
+    assert by_model["model_a"].missing_pct == pytest.approx(0.0)
+    assert by_model["model_b"].expected_n == 3
+    assert by_model["model_b"].present_nonnull_n == 2
+    assert by_model["model_b"].missing_count == 1
+    assert by_model["model_b"].missing_pct == pytest.approx(100 / 3)
+
+
+def test_compute_dataset_missingness_marks_absent_included_model_fully_missing() -> None:
+    df_avg = pl.DataFrame(
+        {
+            "example_id": ["q1", "q2"],
+            "model_id": ["model_a", "model_a"],
+            "reward_mean": [1.0, 0.5],
+        }
+    )
+
+    rows = winrate_api._compute_dataset_missingness("dataset", df_avg, ["model_a", "model_b"])
+    by_model = {row.model: row for row in rows}
+
+    assert by_model["model_a"].missing_count == 0
+    assert by_model["model_a"].missing_pct == pytest.approx(0.0)
+    assert by_model["model_b"].expected_n == 2
+    assert by_model["model_b"].present_nonnull_n == 0
+    assert by_model["model_b"].missing_count == 2
+    assert by_model["model_b"].missing_pct == pytest.approx(100.0)
+
+
 def test_filter_models_is_case_insensitive() -> None:
     filtered = winrate_api._filter_models(
         ["Model_A", "Model_B", "Model_C"],
diff --git a/tests/test_cli/test_process_workspace.py b/tests/test_cli/test_process_workspace.py
index fa1444a7..d8df1af5 100644
--- a/tests/test_cli/test_process_workspace.py
+++ b/tests/test_cli/test_process_workspace.py
@@ -50,6 +50,69 @@ def _fake_download_hf_repo(**_kwargs) -> Path:
     assert copied in result.files_copied
 
 
+def test_prepare_output_workspace_clean_skips_hf_baseline(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    output_dir = tmp_path / "output"
+    output_dir.mkdir()
+    sentinel = output_dir / "stale.txt"
+    sentinel.write_text("stale", encoding="utf-8")
+
+    def _fail_prepare_hf_baseline(**_kwargs) -> workspace.BaselineResult:
+        raise AssertionError("prepare_hf_baseline should not run when clean=True")
+
+    monkeypatch.setattr(workspace, "prepare_hf_baseline", _fail_prepare_hf_baseline)
+
+    result = workspace.prepare_output_workspace(
+        output_dir=output_dir,
+        hf_config=HFSyncConfig(repo_id="demo/repo"),
+        pull_policy="pull",
+        clean=True,
+        assume_yes=True,
+        is_tty=False,
+        prompt_func=None,
+    )
+
+    assert result.cleaned is True
+    assert result.baseline_result is None
+    assert not sentinel.exists()
+
+
+def test_prepare_output_workspace_runs_hf_baseline_before_local_reads(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    output_dir = tmp_path / "output"
+    snapshot_dir = tmp_path / "snapshot"
+    snapshot_dir.mkdir()
+    parquet_path = _write_snapshot(snapshot_dir)
+
+    def _fake_prepare_hf_baseline(**_kwargs) -> workspace.BaselineResult:
+        copied = output_dir / parquet_path.relative_to(snapshot_dir)
+        copied.parent.mkdir(parents=True, exist_ok=True)
+        copied.write_text(parquet_path.read_text(encoding="utf-8"), encoding="utf-8")
+        (output_dir / "env_index.json").write_text((snapshot_dir / "env_index.json").read_text(encoding="utf-8"))
+        return workspace.BaselineResult(policy="pull", files_copied=[copied], snapshot_dir=snapshot_dir)
+
+    monkeypatch.setattr(workspace, "prepare_hf_baseline", _fake_prepare_hf_baseline)
+
+    result = workspace.prepare_output_workspace(
+        output_dir=output_dir,
+        hf_config=HFSyncConfig(repo_id="demo/repo"),
+        pull_policy="pull",
+        clean=False,
+        assume_yes=False,
+        is_tty=False,
+        prompt_func=None,
+    )
+
+    assert result.cleaned is False
+    assert result.baseline_result is not None
+    assert (output_dir / "env_index.json").exists()
+    assert (output_dir / "model-a" / "env-a.parquet").exists()
+
+
 def test_prepare_hf_baseline_pull_keeps_unrelated_local(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
     snapshot_dir = tmp_path / "snapshot"
     snapshot_dir.mkdir()
@@ -113,6 +176,7 @@ def _fake_download_hf_repo(**_kwargs) -> Path:
         return snapshot_dir
 
     monkeypatch.setattr(workspace, "download_hf_repo", _fake_download_hf_repo)
+    monkeypatch.setattr(workspace, "compute_pending_parquet_uploads", lambda **_kwargs: set())
     hf_config = HFSyncConfig(repo_id="demo/repo")
     output_dir = tmp_path / "output"
     output_dir.mkdir()
@@ -136,6 +200,190 @@ def _prompt(_message: str) -> str:
     assert local_path.read_text(encoding="utf-8") == "remote"
 
 
+def test_prepare_hf_baseline_prompt_offers_upload_when_pending_exists(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    output_dir = tmp_path / "output"
+    output_dir.mkdir()
+    _write_snapshot(output_dir, content="local")
+
+    prompts: list[str] = []
+
+    def _prompt(message: str) -> str:
+        prompts.append(message)
+        return "upload"
+
+    def _fail_download(**_kwargs) -> Path:
+        raise AssertionError("download_hf_repo should not be called for upload recovery")
+
+    monkeypatch.setattr(workspace, "download_hf_repo", _fail_download)
+    monkeypatch.setattr(
+        workspace,
+        "compute_pending_parquet_uploads",
+        lambda **_kwargs: {"model-a/env-a.parquet"},
+    )
+
+    result = workspace.prepare_hf_baseline(
+        output_dir=output_dir,
+        hf_config=HFSyncConfig(repo_id="demo/repo"),
+        pull_policy="prompt",
+        is_tty=True,
+        prompt_func=_prompt,
+    )
+
+    assert result.policy == "continue-upload"
+    assert result.pending_parquet_uploads == {"model-a/env-a.parquet"}
+    assert prompts and "upload" in prompts[0]
+
+
+def test_prepare_hf_baseline_continue_upload_skips_download(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    output_dir = tmp_path / "output"
+    output_dir.mkdir()
+    _write_snapshot(output_dir, content="local")
+
+    def _fail_download(**_kwargs) -> Path:
+        raise AssertionError("download_hf_repo should not be called for continue-upload")
+
+    monkeypatch.setattr(workspace, "download_hf_repo", _fail_download)
+    monkeypatch.setattr(
+        workspace,
+        "compute_pending_parquet_uploads",
+        lambda **_kwargs: {"model-a/env-a.parquet"},
+    )
+
+    result = workspace.prepare_hf_baseline(
+        output_dir=output_dir,
+        hf_config=HFSyncConfig(repo_id="demo/repo"),
+        pull_policy="continue-upload",
+        is_tty=False,
+        prompt_func=None,
+    )
+
+    assert result.policy == "continue-upload"
+    assert result.pending_parquet_uploads == {"model-a/env-a.parquet"}
+
+
+def test_prepare_hf_baseline_prompt_hides_upload_when_recovery_check_fails(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+    caplog: pytest.LogCaptureFixture,
+) -> None:
+    output_dir = tmp_path / "output"
+    output_dir.mkdir()
+    _write_snapshot(output_dir, content="local")
+
+    prompts: list[str] = []
+
+    def _prompt(message: str) -> str:
+        prompts.append(message)
+        return "pull"
+
+    monkeypatch.setattr(
+        workspace,
+        "compute_pending_parquet_uploads",
+        lambda **_kwargs: (_ for _ in ()).throw(RuntimeError("hf down")),
+    )
+    monkeypatch.setattr(workspace, "download_hf_repo", lambda **_kwargs: tmp_path / "snapshot")
+
+    with caplog.at_level("WARNING"):
+        result = workspace.prepare_hf_baseline(
+            output_dir=output_dir,
+            hf_config=HFSyncConfig(repo_id="demo/repo"),
+            pull_policy="prompt",
+            is_tty=True,
+            prompt_func=_prompt,
+        )
+
+    assert result.policy == "pull"
+    assert result.pending_parquet_uploads == set()
+    assert prompts and "upload" not in prompts[0]
+    assert "HF upload recovery check failed before prompt" in caplog.text
+
+
+def test_prepare_hf_baseline_continue_upload_empty_dir_warns_and_pulls(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+    caplog: pytest.LogCaptureFixture,
+) -> None:
+    snapshot_dir = tmp_path / "snapshot"
+    snapshot_dir.mkdir()
+    _write_snapshot(snapshot_dir, content="remote")
+    monkeypatch.setattr(workspace, "download_hf_repo", lambda **_kwargs: snapshot_dir)
+
+    with caplog.at_level("WARNING"):
+        result = workspace.prepare_hf_baseline(
+            output_dir=tmp_path / "output",
+            hf_config=HFSyncConfig(repo_id="demo/repo"),
+            pull_policy="continue-upload",
+            is_tty=False,
+            prompt_func=None,
+        )
+
+    assert result.policy == "pull"
+    assert "falling back to pull" in caplog.text
+
+
+def test_prepare_hf_baseline_continue_upload_degrades_when_recovery_check_fails(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+    caplog: pytest.LogCaptureFixture,
+) -> None:
+    output_dir = tmp_path / "output"
+    output_dir.mkdir()
+    _write_snapshot(output_dir, content="local")
+
+    monkeypatch.setattr(
+        workspace,
+        "compute_pending_parquet_uploads",
+        lambda **_kwargs: (_ for _ in ()).throw(RuntimeError("hf down")),
+    )
+
+    with caplog.at_level("WARNING"):
+        result = workspace.prepare_hf_baseline(
+            output_dir=output_dir,
+            hf_config=HFSyncConfig(repo_id="demo/repo"),
+            pull_policy="continue-upload",
+            is_tty=False,
+            prompt_func=None,
+        )
+
+    assert result.policy == "continue-upload"
+    assert result.pending_parquet_uploads == set()
+    assert "uploading only current touched files" in caplog.text
+
+
+@pytest.mark.parametrize("exc_type", [EOFError, KeyboardInterrupt])
+def test_prepare_hf_baseline_prompt_aborts_cleanly(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+    exc_type: type[BaseException],
+) -> None:
+    output_dir = tmp_path / "output"
+    output_dir.mkdir()
+    _write_snapshot(output_dir, content="local")
+    monkeypatch.setattr(
+        workspace,
+        "compute_pending_parquet_uploads",
+        lambda **_kwargs: {"model-a/env-a.parquet"},
+    )
+
+    def _prompt(_message: str) -> str:
+        raise exc_type
+
+    with pytest.raises(RuntimeError, match="Aborted HF baseline selection."):
+        workspace.prepare_hf_baseline(
+            output_dir=output_dir,
+            hf_config=HFSyncConfig(repo_id="demo/repo"),
+            pull_policy="prompt",
+            is_tty=True,
+            prompt_func=_prompt,
+        )
+
+
 def test_prepare_hf_baseline_pull_skips_when_local_baseline_present(
     monkeypatch: pytest.MonkeyPatch,
     tmp_path: Path,
diff --git a/tests/test_mcq_accuracy.py b/tests/test_mcq_accuracy.py
index 209e7ece..5365ddc4 100644
--- a/tests/test_mcq_accuracy.py
+++ b/tests/test_mcq_accuracy.py
@@ -1,8 +1,14 @@
 """Tests for the simplified MCQ accuracy grader."""
 
+import time
+
 import pytest
 
-from medarc_verifiers.rewards.multiple_choice_accuracy import MCQAccuracyResult, multiple_choice_accuracy
+from medarc_verifiers.rewards.multiple_choice_accuracy import (
+    MCQAccuracyResult,
+    _contains_multiple_option_led_sentences,
+    multiple_choice_accuracy,
+)
 
 
 def test_anchored_final_answer_colon():
@@ -35,6 +41,18 @@ def test_anchored_numeric():
     assert multiple_choice_accuracy("The answer is 3", answer_letter="3", answer_text="Third option")
 
 
+@pytest.mark.parametrize(
+    ("response", "answer_letter"),
+    [
+        ("The answer is option A", "A"),
+        ("Final answer: choice (B)", "B"),
+        ("Option 2", "2"),
+    ],
+)
+def test_option_word_forms_are_parsed(response: str, answer_letter: str):
+    assert multiple_choice_accuracy(response, answer_letter=answer_letter, answer_text="Option")
+
+
 def test_last_token_single_letter_at_end():
     assert multiple_choice_accuracy("I think it's C", answer_letter="C", answer_text="Correct option")
 
@@ -43,6 +61,23 @@ def test_last_token_with_period():
     assert multiple_choice_accuracy("My selection is B.", answer_letter="B", answer_text="Some text")
 
 
+@pytest.mark.parametrize(
+    ("response", "answer_letter"),
+    [
+        ("My selection is [C]", "C"),
+        ("My selection is C]", "C"),
+        ("My selection is C)", "C"),
+        ("My selection is (C)", "C"),
+        ("My selection is [2]", "2"),
+        ("My selection is 2]", "2"),
+        ("My selection is 2)", "2"),
+        ("My selection is (2)", "2"),
+    ],
+)
+def test_last_token_bracket_like_variants(response: str, answer_letter: str):
+    assert multiple_choice_accuracy(response, answer_letter=answer_letter, answer_text="Option")
+
+
 def test_last_token_multiple_letters_takes_last():
     # A and B appear in reasoning, D is the final answer
     assert multiple_choice_accuracy("A is wrong. B seems unlikely. D", answer_letter="D", answer_text="Final option")
@@ -60,6 +95,35 @@ def test_last_token_wrong():
     assert not multiple_choice_accuracy("My answer is A", answer_letter="B", answer_text="Correct")
 
 
+@pytest.mark.parametrize(
+    "response",
+    [
+        "A, C",
+        "A; C",
+        "A: C",
+        "A. C",
+        "A) C",
+        r"A,\ C,\ D",
+        "B, C, E",
+        "D / G / J",
+        "(A), (C)",
+        "[B], [C], [E]",
+        "A or C",
+        "B and E",
+        "B, and E",
+        "B, & D",
+        "both A and C",
+        "A & C",
+        "A + C",
+        "A y C",
+        "A e D",
+        "A ou C",
+    ],
+)
+def test_last_token_rejects_compact_multi_option_lists(response: str):
+    assert not multiple_choice_accuracy(response, answer_letter="C", answer_text="Option", accept_answer_text=False)
+
+
 def test_last_token_disabled_when_explicit_anchor_exists_even_if_wrong():
     # Regression: do NOT allow last_token to override an explicit (wrong) anchored choice.
     response = (
@@ -82,6 +146,14 @@ def test_answer_text_exact_match():
     )
 
 
+def test_answer_text_exact_match_allows_numeric_boxed_content():
+    assert multiple_choice_accuracy(
+        r"\boxed{4}",
+        answer_letter="A",
+        answer_text="4.",
+    )
+
+
 def test_answer_text_in_sentence():
     assert multiple_choice_accuracy(
         "Based on the symptoms, acute myocardial infarction is most likely.",
@@ -90,6 +162,24 @@ def test_answer_text_in_sentence():
     )
 
 
+@pytest.mark.parametrize("response", ["All of the above", "The answer is all of the above."])
+def test_answer_text_all_of_the_above_is_not_rejected(response: str):
+    assert multiple_choice_accuracy(response, answer_letter="D", answer_text="All of the above")
+
+
+@pytest.mark.parametrize("response", ["None of the above", "The answer is none of the above."])
+def test_answer_text_none_of_the_above_is_not_rejected(response: str):
+    assert multiple_choice_accuracy(response, answer_letter="E", answer_text="None of the above")
+
+
+def test_multi_answer_tail_does_not_count_as_all_of_the_above():
+    assert not multiple_choice_accuracy("A and B", answer_letter="D", answer_text="All of the above")
+
+
+def test_all_of_the_above_does_not_match_plain_option_text():
+    assert not multiple_choice_accuracy("All of the above", answer_letter="C", answer_text="acute appendicitis")
+
+
 def test_answer_text_case_insensitive():
     assert multiple_choice_accuracy(
         "The diagnosis is DIABETES MELLITUS TYPE 2", answer_letter="D", answer_text="Diabetes Mellitus Type 2"
@@ -107,6 +197,31 @@ def test_answer_text_substring_not_matched():
     assert not multiple_choice_accuracy("Patient has tension headaches", answer_letter="A", answer_text="hypertension")
 
 
+def test_answer_text_fallback_rejects_bulleted_option_elimination_lines():
+    response = (
+        "The most likely diagnosis is Kawasaki Disease (D).\n"
+        "Elimination of other options:\n"
+        "   - Measles: Measles typically presents differently.\n"
+        "   - Scarlet fever: also less likely."
+    )
+    result = multiple_choice_accuracy(response, answer_letter="A", answer_text="Measles.", return_details=True)
+    assert result.is_correct is False
+    assert result.method == "none"
+
+
+def test_answer_text_fallback_rejects_bulleted_other_options_lines():
+    response = (
+        "The safest and fastest airway is cricothyrotomy (Choice A).\n"
+        "Other options:\n"
+        "   - Emergency tracheostomy - more time-consuming in an unstable patient."
+    )
+    result = multiple_choice_accuracy(
+        response, answer_letter="D", answer_text="Emergency tracheostomy", return_details=True
+    )
+    assert result.is_correct is False
+    assert result.method == "none"
+
+
 def test_normalization_extra_whitespace():
     assert multiple_choice_accuracy("Final   answer:    C  ", answer_letter="C", answer_text="Option C")
 
@@ -154,17 +269,6 @@ def test_return_details_last_token():
     assert result.correct_answer == "B"
 
 
-def test_return_details_answer_text():
-    result = multiple_choice_accuracy(
-        "The patient has acute appendicitis", answer_letter="D", answer_text="acute appendicitis", return_details=True
-    )
-    assert isinstance(result, MCQAccuracyResult)
-    assert result.is_correct is True
-    assert result.method == "answer_text"
-    assert result.matched_answer == "the patient has acute appendicitis"
-    assert result.correct_answer == "acute appendicitis"
-
-
 def test_return_details_no_match():
     result = multiple_choice_accuracy("I don't know", answer_letter="C", answer_text="Option C", return_details=True)
     assert isinstance(result, MCQAccuracyResult)
@@ -221,6 +325,16 @@ def test_unpaired_think_close_with_spurious_match():
     assert multiple_choice_accuracy(response, answer_letter="A", answer_text="Option A")
 
 
+def test_multiple_think_blocks_use_last_close():
+    response = "<think>first</think> draft <think>second</think>\n\nFinal answer: B"
+    assert multiple_choice_accuracy(response, answer_letter="B", answer_text="Option B")
+
+
+def test_unclosed_think_open_returns_empty():
+    response = "<think>reasoning only Final answer: C"
+    assert not multiple_choice_accuracy(response, answer_letter="C", answer_text="Option C")
+
+
 def test_cot_prevents_early_letter_matching():
     # Should not match A or B from the reasoning
     cot_response = """
@@ -248,6 +362,15 @@ def test_edge_case_letter_in_medical_term():
     )
 
 
+def test_answer_text_match_ignores_terminal_period_difference():
+    response = "Final answer: Furosemide-responsive cardiogenic pulmonary edema"
+    assert multiple_choice_accuracy(
+        response,
+        answer_letter="B",
+        answer_text="Furosemide-responsive cardiogenic pulmonary edema.",
+    )
+
+
 def test_edge_case_hemoglobin_a1c():
     # "A" in "A1c" should not match
     assert multiple_choice_accuracy("HbA1c is elevated. The answer is B", answer_letter="B", answer_text="Option B")
@@ -353,10 +476,19 @@ def test_leading_option_with_no_and_punctuation_should_pass():
     assert multiple_choice_accuracy("B) No.", answer_letter="B", answer_text="No")
 
 
-def test_last_token_negation_same_sentence_blocks():
-    # No anchor phrase, so it falls to last_token.
-    # Because "Not" is in the same sentence, the final "C" should be blocked.
-    assert not multiple_choice_accuracy("Not C, wait, C", answer_letter="C", answer_text="Option C")
+@pytest.mark.parametrize(
+    ("response", "answer_text"),
+    [
+        ("A (Nadolol)", "Nadolol"),
+        ("A - Nadolol", "Nadolol"),
+        ("A – Nadolol", "Nadolol"),
+    ],
+)
+def test_leading_option_with_parenthetical_or_dash_answer_text(response: str, answer_text: str):
+    result = multiple_choice_accuracy(response, answer_letter="A", answer_text=answer_text, return_details=True)
+    assert result.is_correct is True
+    assert result.method == "anchored_token"
+    assert result.matched_answer == "A"
 
 
 def test_last_token_negation_previous_sentence_does_not_block():
@@ -368,8 +500,143 @@ def test_last_token_isnt_previous_sentence_does_not_block():
     assert multiple_choice_accuracy("It isn't C. C", answer_letter="C", answer_text="Option C")
 
 
-def test_last_token_isnt_same_sentence_blocks():
-    assert not multiple_choice_accuracy("It isn't C, but maybe C", answer_letter="C", answer_text="Option C")
+def test_answer_text_rather_than_prefix_blocks():
+    response = "The diagnosis is viral rather than bacterial pneumonia."
+    assert not multiple_choice_accuracy(response, answer_letter="B", answer_text="bacterial pneumonia")
+
+
+def test_answer_text_wrong_prefix_blocks():
+    response = "The wrong diagnosis is bacterial pneumonia."
+    assert not multiple_choice_accuracy(response, answer_letter="B", answer_text="bacterial pneumonia")
+
+
+def test_anchored_token_contradicted_by_later_option_blocks():
+    response = "Answer: C, but D is correct."
+    assert not multiple_choice_accuracy(response, answer_letter="C", answer_text="Option C")
+
+
+def test_anchored_token_instead_correction_blocks():
+    response = "Answer: C, instead D is correct."
+    assert not multiple_choice_accuracy(response, answer_letter="C", answer_text="Option C")
+
+
+def test_anchored_token_instead_of_preference_does_not_block():
+    response = "Answer: C instead of D."
+    assert multiple_choice_accuracy(response, answer_letter="C", answer_text="Option C")
+    assert not multiple_choice_accuracy(response, answer_letter="D", answer_text="Option D")
+
+
+def test_multi_answer_anchors_elsewhere_do_not_poison_final_anchor():
+    response = "Option A and Option C were considered earlier. Final answer: B"
+    result = multiple_choice_accuracy(response, answer_letter="B", answer_text="Option B", return_details=True)
+    assert result.is_correct is True
+    assert result.method == "anchored_token"
+    assert result.matched_answer == "B"
+
+
+@pytest.mark.parametrize(
+    "response",
+    [
+        (
+            "Option (A) and Option (I) are both correct statements concerning feeding for this patient, but "
+            "since the prompt asks for a singular choice that is true, the most directly relevant and universally "
+            "accepted principle would be (A) Enteral nutrition may decrease infection due to the prevention of "
+            "bacterial translocation, highlighting a key benefit of enteral feeding in acute pancreatitis management."
+        ),
+        (
+            "Answer: option A and option I are both correct. If I must pick one, I would lean toward A because "
+            "enteral nutrition may decrease infection due to the prevention of bacterial translocation."
+        ),
+        (
+            "Option A as well as option I are valid here. The better-supported statement is A: "
+            "Enteral nutrition may decrease infection due to the prevention of bacterial translocation."
+        ),
+        (
+            "Choice A or choice I could both be defended. The most directly relevant principle would be A "
+            "(Enteral nutrition may decrease infection due to the prevention of bacterial translocation)."
+        ),
+        (
+            "Selected options: A and I. Since only one answer is requested, I would prefer A - "
+            "Enteral nutrition may decrease infection due to the prevention of bacterial translocation."
+        ),
+        (
+            "Option (A), together with option (I), is correct for feeding in severe acute pancreatitis; "
+            "among them, (A) Enteral nutrition may decrease infection due to the prevention of bacterial "
+            "translocation is the most important principle."
+        ),
+    ],
+)
+def test_answer_text_fallback_allows_disambiguated_multi_candidate_payloads(response: str):
+    result_a = multiple_choice_accuracy(
+        response,
+        answer_letter="A",
+        answer_text="Enteral nutrition may decrease infection due to the prevention of bacterial translocation.",
+        accept_answer_text=True,
+        return_details=True,
+    )
+    assert result_a.is_correct is True
+    assert result_a.method in {"answer_text", "anchored_token"}
+
+    result_i = multiple_choice_accuracy(
+        response,
+        answer_letter="I",
+        answer_text="Feeding should begin within 24-48 hours.",
+        accept_answer_text=True,
+        return_details=True,
+    )
+    assert result_i.is_correct is False
+    assert result_i.method in {"none", "anchored_token"}
+
+
+@pytest.mark.parametrize(
+    "response",
+    [
+        "(A) Naloxone is a synthetic N-allyl derivative of oxymorphone. (D) Naloxone is not rapidly absorbed after oral administration.",
+        "A. Naloxone is a synthetic N-allyl derivative of oxymorphone.\nD. Naloxone is not rapidly absorbed after oral administration.",
+        "(A) First statement. (C) Second statement.",
+    ],
+)
+def test_answer_text_fallback_rejects_multiple_option_led_sentences(response: str):
+    result_a = multiple_choice_accuracy(
+        response,
+        answer_letter="A",
+        answer_text="Naloxone is a synthetic N-allyl derivative of oxymorphone.",
+        accept_answer_text=True,
+        return_details=True,
+    )
+    assert result_a.is_correct is False
+    assert result_a.method == "none"
+
+
+def test_answer_text_fallback_allows_option_led_sentences_after_prose_preface():
+    response = (
+        "Let me compare the statements before picking one.\n"
+        "A. Naloxone is a synthetic N-allyl derivative of oxymorphone.\n"
+        "D. Naloxone is not rapidly absorbed after oral administration.\n"
+        "The correct statement is Naloxone is a synthetic N-allyl derivative of oxymorphone."
+    )
+    result = multiple_choice_accuracy(
+        response,
+        answer_letter="A",
+        answer_text="Naloxone is a synthetic N-allyl derivative of oxymorphone.",
+        accept_answer_text=True,
+        return_details=True,
+    )
+    assert result.is_correct is True
+    assert result.method == "answer_text"
+
+
+def test_multiple_option_led_sentence_scan_handles_large_payload_linearly():
+    response = ("Reasoning sentence with details. " * 12000) + "Final answer: C"
+    started = time.perf_counter()
+    assert _contains_multiple_option_led_sentences(response, answer_letter="C") is False
+    elapsed = time.perf_counter() - started
+    assert elapsed < 1.0
+
+
+def test_large_reasoning_payload_still_accepts_final_answer():
+    response = ("Reasoning sentence with details. " * 12000) + "Final answer: C"
+    assert multiple_choice_accuracy(response, answer_letter="C", answer_text="Option C")
 
 
 def test_answer_text_does_not_override_explicit_wrong_choice():
@@ -396,6 +663,15 @@ def test_answer_text_used_when_no_explicit_choice_letter_present():
     assert multiple_choice_accuracy(response, answer_letter="B", answer_text="poststreptocococcal glomerulonephritis")
 
 
+def test_answer_text_fallback_does_not_match_single_letter_article():
+    response = (
+        "The question asks which structure would most likely change with another infectious illness.\n"
+        "A is often the heart, B the diaphragm, C the aorta, and D a bony structure.\n"
+        "Thus the changing structure is E."
+    )
+    assert not multiple_choice_accuracy(response, answer_letter="A", answer_text="A", return_details=False)
+
+
 def test_negated_anchor_does_not_block_answer_text_fallback():
     response = "The answer is not C. The correct diagnosis is acute appendicitis."
     assert multiple_choice_accuracy(response, answer_letter="D", answer_text="acute appendicitis")
@@ -479,6 +755,18 @@ def test_block_prompt_then_option_on_next_line_parses_choice_letter():
     )
 
 
+def test_parenthesized_answer_text_does_not_fall_to_trailing_option_letter():
+    result = multiple_choice_accuracy(
+        "B (5-fluorouracil and mitomycin C)",
+        answer_letter="B",
+        answer_text="5-fluorouracil and mitomycin C",
+        return_details=True,
+    )
+    assert result.is_correct is True
+    assert result.method == "anchored_token"
+    assert result.matched_answer == "B"
+
+
 def test_anchor_phrase_with_markdown_wrapper_parses_choice_letter():
     response = "Answer: **(C)**"
     result = multiple_choice_accuracy(response, answer_letter="C", answer_text="Option C", return_details=True)
@@ -563,6 +851,18 @@ def test_answer_text_requires_exact_formatting_beyond_normalization(response, an
     assert not multiple_choice_accuracy(response, answer_letter="D", answer_text=answer_text, accept_answer_text=True)
 
 
+@pytest.mark.parametrize(
+    "response, answer_text",
+    [
+        ("<answer>Proliferation of surfactant‑secreting cells</answer>", "Proliferation of surfactant-secreting cells"),
+        ("<answer>Anti‑D IgG</answer>", "Anti-D IgG"),
+        ("<answer>Upslope of T‑wave</answer>", "Upslope of T-wave"),
+    ],
+)
+def test_answer_text_matches_unicode_dash_variants(response, answer_text):
+    assert multiple_choice_accuracy(response, answer_letter="D", answer_text=answer_text, accept_answer_text=True)
+
+
 def test_multiple_answers_last_explicit_anchor_wins():
     response = "Answer: B. After reconsideration, final answer: C"
     assert multiple_choice_accuracy(response, answer_letter="C", answer_text="Option C")
diff --git a/tests/test_process_writer_schema.py b/tests/test_process_writer_schema.py
index c38a18aa..b4607c5f 100644
--- a/tests/test_process_writer_schema.py
+++ b/tests/test_process_writer_schema.py
@@ -44,6 +44,7 @@ def test_process_writer_emits_stable_schema_with_all_null_values(tmp_path) -> No
     summaries = writer.write_env_groups([group], config, write_index=False)
     schema = pq.ParquetFile(summaries[0].output_path).schema_arrow
 
+    assert str(schema.field("example_id").type) == "large_string"
     assert str(schema.field("extras").type) == "large_string"
     assert str(schema.field("answer").type) == "large_string"
     assert str(schema.field("error").type) == "large_string"
@@ -64,6 +65,7 @@ def test_process_writer_emits_stable_schema_for_empty_groups(tmp_path) -> None:
     summaries = writer.write_env_groups([group], config, write_index=False)
     schema = pq.ParquetFile(summaries[0].output_path).schema_arrow
 
+    assert str(schema.field("example_id").type) == "large_string"
     assert str(schema.field("extras").type) == "large_string"
     assert str(schema.field("answer").type) == "large_string"
     assert str(schema.field("error").type) == "large_string"
diff --git a/tests/test_xml_parser.py b/tests/test_xml_parser.py
index 8bb8978a..dc74b036 100644
--- a/tests/test_xml_parser.py
+++ b/tests/test_xml_parser.py
@@ -28,6 +28,18 @@ def test_parse_string_handles_tags() -> None:
     assert parsed.think == "inner"
 
 
+def test_parse_answer_uses_last_tag_in_message_content() -> None:
+    parser = XMLParser(["answer"])
+    completion = [
+        {
+            "role": "assistant",
+            "content": '<think>Follow "The answer is <answer>X</answer>" exactly.</think>\n\nThe answer is <answer>C</answer>',
+        }
+    ]
+
+    assert parser.parse_answer(completion) == "C"
+
+
 def test_init_with_think_does_not_warn(caplog: pytest.LogCaptureFixture) -> None:
     with caplog.at_level("WARNING"):
         XMLParser(["think", "answer"])