From 9a48714a23a3ca4cf90b1095f1c290926b14d087 Mon Sep 17 00:00:00 2001 From: RomirJ Date: Fri, 15 May 2026 19:59:05 -0400 Subject: [PATCH] docs: VERIFICATION.md trust-receipt guide (closes #35) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds docs/verification.md walking users through the auto-generated VERIFICATION.md artifact — the trust receipt proving an exported ONNX model matches its PyTorch source at machine precision. Structure verified against src/reflex/verification_report.py — every section + field name in the doc maps to a real line of the renderer. Adapted from #130: - Kept the fixture numbers (1.192e-07, 2.384e-07, etc.) — they're float32 ULP multiples, characteristic of correctly-exported models, and 2.384e-07 lines up with pi0.5's actual first-action max_abs (2.38e-07) from the README parity ledger. - Refreshed Reflex version (0.2.1 → 0.9.6) + added a note that the field auto-fills from reflex.__version__. - Cross-referenced the parity ledger explicitly so readers can sanity- check their own runs against the shipped numbers (SmolVLA 5.96e-07, pi0 2.09e-07, pi0.5 2.38e-07, GR00T 8.34e-07). - Renamed file from understanding_verification.md to verification.md for consistency with sibling docs (eval.md, embodiment_schema.md). - Replaced broken TROUBLESHOOTING.md cross-ref with reflex doctor flow. - Kept the 'Interpreting results by vertical' table — verticals match the FastCrest customer research vault's P0 tier. - Added 'Files' section preamble (Total: N files, X size) that #130 omitted but the live renderer emits. Supersedes #130. Co-Authored-By: Divyansh Rawat <186957976+DsThakurRawat@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/verification.md | 212 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 212 insertions(+) create mode 100644 docs/verification.md diff --git a/docs/verification.md b/docs/verification.md new file mode 100644 index 0000000..2589405 --- /dev/null +++ b/docs/verification.md @@ -0,0 +1,212 @@ +# How to Read Your VERIFICATION.md + +The `VERIFICATION.md` file is your **trust receipt** — it proves that the exported ONNX model produces the same outputs as the original PyTorch checkpoint. This guide explains every section in plain English. + +The exact rendering code lives in [`src/reflex/verification_report.py`](../src/reflex/verification_report.py). If you ever see a mismatch between this doc and an actual report, the source file wins. + +--- + +## When is it created? + +`VERIFICATION.md` is auto-generated at two points: + +1. **`reflex export`** — creates a skeleton with file hashes but no parity numbers yet. +2. **`reflex validate`** — fills in the numerical parity results. + +Until you run `reflex validate`, the parity section will say _"Not yet verified. Run `reflex validate ` to populate."_ The Files + Export metadata sections are populated at export time regardless. + +--- + +## Section-by-section breakdown + +### Export metadata + +```markdown +- **Model:** `lerobot/smolvla-base` +- **Model type:** smolvla +- **Target:** orin-nano +- **ONNX opset:** 19 +- **Denoising steps (baked in):** 10 +- **Action chunk size:** 50 +- **Reflex version:** 0.9.6 +- **Platform:** Linux-5.15.0-aarch64 +``` + +| Field | What it means | +|---|---| +| **Model** | The HuggingFace model ID or local path that was exported | +| **Model type** | Architecture family: `smolvla`, `pi0`, `pi05`, `groot` | +| **Target** | Hardware the export was optimized for: `orin-nano`, `desktop`, `thor`, etc. | +| **ONNX opset** | ONNX operator set version. Higher = more ops available. Standard: 19 | +| **Denoising steps** | Number of flow-matching denoise iterations baked into the ONNX graph. More steps = higher quality but slower inference | +| **Action chunk size** | How many future actions the model predicts per inference call | +| **Reflex version** | The `reflex-vla` package version used for export (auto-filled from `reflex.__version__`) | +| **Platform** | `platform.platform()` output where the export was run | + +> **For drones:** The action chunk size is typically smaller (20 vs 50) because flight dynamics require faster replanning. The denoising steps may also be lower for latency-sensitive aerial deployments. + +--- + +### Files table + +```markdown +## Files + +Total: **3 files, 247.5MB** + +| File | Size | SHA256 | +|---|---|---| +| `model.onnx` | 245.3MB | `a1b2c3d4...` | +| `reflex_config.json` | 1.2KB | `e5f6a7b8...` | +| `tokenizer.json` | 957KB | `f9a0b1c2...` | +``` + +| Column | What it means | +|---|---| +| **File** | Every file in your export directory (excluding `VERIFICATION.md` itself, which is regenerated each time) | +| **Size** | Human-readable file size | +| **SHA256** | Cryptographic hash — if even one byte changes, this hash changes completely | + +**Why SHA256 matters:** + +- **Integrity:** If you download an export from a teammate or CI, compare the SHA256 to confirm nothing was corrupted in transit or tampered with at rest. +- **Reproducibility:** Two exports from the same model + same settings should produce identical hashes (given identical opset + precision + chunk_size). +- **Audit trail:** For regulated verticals (warehouse safety, traffic management, defense), the SHA256 chain provides verifiable custody from model authorship → export run → fleet deployment. + +--- + +### Parity section + +This is the most important part — it appears after running `reflex validate`. + +```markdown +## Parity + +**Verdict:** PASS +**Threshold:** 1e-04 +**Fixtures:** 5 +**Seed:** 42 +**max_abs_diff across all fixtures:** 2.384e-07 + +| Fixture | max_abs_diff | mean_abs_diff | Passed | +|---|---|---|---| +| 0 | 1.192e-07 | 3.576e-08 | PASS | +| 1 | 2.384e-07 | 4.768e-08 | PASS | +| 2 | 1.192e-07 | 2.980e-08 | PASS | +| 3 | 1.788e-07 | 4.172e-08 | PASS | +| 4 | 1.192e-07 | 3.278e-08 | PASS | +``` + +The values in this example are float32 ULP multiples (1.192e-07 is the float32 machine epsilon at this magnitude) — characteristic of a correctly-exported model where the only disagreement between PyTorch and ONNX is at the floating-point precision floor. Reflex's parity ledger reports first-action `max_abs` of **5.96e-07** for SmolVLA, **2.09e-07** for pi0, **2.38e-07** for pi0.5, and **8.34e-07** for GR00T — all at machine precision, all reproducible with the same seed. + +#### Key metrics explained + +**`max_abs_diff` (Maximum Absolute Difference)** + +The largest difference between any single output value from PyTorch vs ONNX, across all action dimensions. + +- `2.384e-07` means the biggest disagreement was 0.000000238 — practically zero. +- **Good values:** `< 1e-04` (the default threshold) +- **Concerning values:** `> 1e-03` — the ONNX model may behave differently in deployment +- **Failing values:** `> 1e-02` — the export is unreliable; do not deploy + +> Mental model: "In the worst case, across all test inputs, how far off was any single predicted joint angle (or thrust value for drones)?" + +**`mean_abs_diff` (Mean Absolute Difference)** + +The average difference across all output values. Always smaller than or equal to `max_abs_diff`. + +- Useful for seeing if the error is concentrated in one spot or spread evenly. +- If `mean_abs_diff` ≈ `max_abs_diff`, the error is spread evenly (usually fine). +- If `mean_abs_diff` << `max_abs_diff`, one outlier dimension is noisy (investigate before deploying). + +**`Threshold`** + +The configurable pass/fail cutoff. Default: `1e-04` (0.0001). + +- If `max_abs_diff` < threshold → **PASS** +- If `max_abs_diff` ≥ threshold → **FAIL** + +```bash +# Override the threshold +reflex validate ./reflex_export/ --threshold 1e-3 # more lenient +reflex validate ./reflex_export/ --threshold 1e-5 # stricter +``` + +**`Fixtures`** + +The number of random test inputs used. Each fixture is a synthetic (image, instruction, state) tuple. More fixtures = higher confidence the parity holds across input distribution. + +**`Seed`** + +The random seed used to generate fixtures. Same seed + same model + same export settings = identical results. This is what makes the verification **reproducible** by anyone with the same `reflex-vla` version. + +--- + +### Reproducer + +```markdown +## Reproducer + +```bash +reflex export lerobot/smolvla-base --target orin-nano --output +reflex validate +``` +``` + +Anyone with the model ID + target + opset listed in the metadata block can reproduce the entire export + validation pipeline from scratch and verify the SHA256s + parity numbers match. If they don't match, either (a) the model on HF was updated, or (b) the runner is on a different Reflex version — the version field at the top tells them which. + +--- + +## Interpreting results by vertical + +Different deployments tolerate different levels of `max_abs_diff`. The shipped 1e-04 default is comfortable for most edge-VLA use cases; tightening it for high-precision deployments or loosening it for perception-only models is a deliberate trade-off. + +| Vertical | Acceptable `max_abs_diff` | Notes | +|---|---|---| +| **Warehouse arms** (Franka / UR5 class) | `< 1e-04` | Tight tolerance for precise pick-and-place at sub-millimeter scale | +| **Farm robotics / SO-100 class** | `< 1e-03` | More tolerant for coarse outdoor manipulation | +| **Aerial drones** | `< 1e-04` | Flight control requires high fidelity — small per-step diffs compound at 50 Hz | +| **Smart-camera deployments** (retail, traffic management) | `< 1e-03` | Classification-oriented — tolerant of small numerical drift in attention layers | +| **Mining / heavy industrial** | `< 1e-04` | Safety-critical — minimize any deployment-time surprises | + +--- + +## What to do if validation fails + +1. **Re-export with default settings:** + + ```bash + reflex export --target desktop --precision fp16 + reflex validate ./reflex_export/ + ``` + + Some custom targets or precisions (`fp8`, `int8`) widen the tolerance gap. Start with `fp16` on the `desktop` target to isolate model issues from quantization issues. + +2. **Try a lower opset:** + + ```bash + reflex export --opset 17 + ``` + + Some attention-heavy models hit precision-sensitive ops in opset 19; opset 17 falls back to slower but more numerically stable kernels. + +3. **Run `reflex doctor`:** + + ```bash + reflex doctor --export-dir ./reflex_export/ + ``` + + Catches the common silent-failure modes (cuDNN version skew, TRT EP loadchain breakage, JetPack mismatches) that manifest as parity failures. + +4. **File an issue:** If a shipped model in the registry consistently fails validation, open a GitHub issue with the full `VERIFICATION.md` attached + your `reflex doctor` output. The Reflex maintainers can reproduce against the registry's expected hash. + +--- + +## See also + +- [`docs/cli_reference.md`](./cli_reference.md) — full flag list for `reflex export` and `reflex validate` +- [`docs/eval.md`](./eval.md) — task-success eval (parity is a necessary but not sufficient condition; pass eval also) +- [`docs/doctor_check_list.md`](./doctor_check_list.md) — what `reflex doctor` checks +- [`docs/adding_a_robot.md`](./adding_a_robot.md) — embodiment cookbook (the embodiment config affects state shape passed to the policy) +- [`src/reflex/verification_report.py`](../src/reflex/verification_report.py) — the renderer (authoritative)