From 9a48714a23a3ca4cf90b1095f1c290926b14d087 Mon Sep 17 00:00:00 2001
From: RomirJ <playindus@gmail.com>
Date: Fri, 15 May 2026 19:59:05 -0400
Subject: [PATCH] docs: VERIFICATION.md trust-receipt guide (closes #35)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds docs/verification.md walking users through the auto-generated
VERIFICATION.md artifact — the trust receipt proving an exported ONNX
model matches its PyTorch source at machine precision.

Structure verified against src/reflex/verification_report.py — every
section + field name in the doc maps to a real line of the renderer.

Adapted from #130:
- Kept the fixture numbers (1.192e-07, 2.384e-07, etc.) — they're
  float32 ULP multiples, characteristic of correctly-exported models,
  and 2.384e-07 lines up with pi0.5's actual first-action max_abs
  (2.38e-07) from the README parity ledger.
- Refreshed Reflex version (0.2.1 → 0.9.6) + added a note that the
  field auto-fills from reflex.__version__.
- Cross-referenced the parity ledger explicitly so readers can sanity-
  check their own runs against the shipped numbers (SmolVLA 5.96e-07,
  pi0 2.09e-07, pi0.5 2.38e-07, GR00T 8.34e-07).
- Renamed file from understanding_verification.md to verification.md
  for consistency with sibling docs (eval.md, embodiment_schema.md).
- Replaced broken TROUBLESHOOTING.md cross-ref with reflex doctor flow.
- Kept the 'Interpreting results by vertical' table — verticals match
  the FastCrest customer research vault's P0 tier.
- Added 'Files' section preamble (Total: N files, X size) that #130
  omitted but the live renderer emits.

Supersedes #130.

Co-Authored-By: Divyansh Rawat <186957976+DsThakurRawat@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/verification.md | 212 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 212 insertions(+)
 create mode 100644 docs/verification.md

diff --git a/docs/verification.md b/docs/verification.md
new file mode 100644
index 0000000..2589405
--- /dev/null
+++ b/docs/verification.md
@@ -0,0 +1,212 @@
+# How to Read Your VERIFICATION.md
+
+The `VERIFICATION.md` file is your **trust receipt** — it proves that the exported ONNX model produces the same outputs as the original PyTorch checkpoint. This guide explains every section in plain English.
+
+The exact rendering code lives in [`src/reflex/verification_report.py`](../src/reflex/verification_report.py). If you ever see a mismatch between this doc and an actual report, the source file wins.
+
+---
+
+## When is it created?
+
+`VERIFICATION.md` is auto-generated at two points:
+
+1. **`reflex export`** — creates a skeleton with file hashes but no parity numbers yet.
+2. **`reflex validate`** — fills in the numerical parity results.
+
+Until you run `reflex validate`, the parity section will say _"Not yet verified. Run `reflex validate <export_dir>` to populate."_ The Files + Export metadata sections are populated at export time regardless.
+
+---
+
+## Section-by-section breakdown
+
+### Export metadata
+
+```markdown
+- **Model:** `lerobot/smolvla-base`
+- **Model type:** smolvla
+- **Target:** orin-nano
+- **ONNX opset:** 19
+- **Denoising steps (baked in):** 10
+- **Action chunk size:** 50
+- **Reflex version:** 0.9.6
+- **Platform:** Linux-5.15.0-aarch64
+```
+
+| Field | What it means |
+|---|---|
+| **Model** | The HuggingFace model ID or local path that was exported |
+| **Model type** | Architecture family: `smolvla`, `pi0`, `pi05`, `groot` |
+| **Target** | Hardware the export was optimized for: `orin-nano`, `desktop`, `thor`, etc. |
+| **ONNX opset** | ONNX operator set version. Higher = more ops available. Standard: 19 |
+| **Denoising steps** | Number of flow-matching denoise iterations baked into the ONNX graph. More steps = higher quality but slower inference |
+| **Action chunk size** | How many future actions the model predicts per inference call |
+| **Reflex version** | The `reflex-vla` package version used for export (auto-filled from `reflex.__version__`) |
+| **Platform** | `platform.platform()` output where the export was run |
+
+> **For drones:** The action chunk size is typically smaller (20 vs 50) because flight dynamics require faster replanning. The denoising steps may also be lower for latency-sensitive aerial deployments.
+
+---
+
+### Files table
+
+```markdown
+## Files
+
+Total: **3 files, 247.5MB**
+
+| File | Size | SHA256 |
+|---|---|---|
+| `model.onnx` | 245.3MB | `a1b2c3d4...` |
+| `reflex_config.json` | 1.2KB | `e5f6a7b8...` |
+| `tokenizer.json` | 957KB  | `f9a0b1c2...` |
+```
+
+| Column | What it means |
+|---|---|
+| **File** | Every file in your export directory (excluding `VERIFICATION.md` itself, which is regenerated each time) |
+| **Size** | Human-readable file size |
+| **SHA256** | Cryptographic hash — if even one byte changes, this hash changes completely |
+
+**Why SHA256 matters:**
+
+- **Integrity:** If you download an export from a teammate or CI, compare the SHA256 to confirm nothing was corrupted in transit or tampered with at rest.
+- **Reproducibility:** Two exports from the same model + same settings should produce identical hashes (given identical opset + precision + chunk_size).
+- **Audit trail:** For regulated verticals (warehouse safety, traffic management, defense), the SHA256 chain provides verifiable custody from model authorship → export run → fleet deployment.
+
+---
+
+### Parity section
+
+This is the most important part — it appears after running `reflex validate`.
+
+```markdown
+## Parity
+
+**Verdict:** PASS
+**Threshold:** 1e-04
+**Fixtures:** 5
+**Seed:** 42
+**max_abs_diff across all fixtures:** 2.384e-07
+
+| Fixture | max_abs_diff | mean_abs_diff | Passed |
+|---|---|---|---|
+| 0 | 1.192e-07 | 3.576e-08 | PASS |
+| 1 | 2.384e-07 | 4.768e-08 | PASS |
+| 2 | 1.192e-07 | 2.980e-08 | PASS |
+| 3 | 1.788e-07 | 4.172e-08 | PASS |
+| 4 | 1.192e-07 | 3.278e-08 | PASS |
+```
+
+The values in this example are float32 ULP multiples (1.192e-07 is the float32 machine epsilon at this magnitude) — characteristic of a correctly-exported model where the only disagreement between PyTorch and ONNX is at the floating-point precision floor. Reflex's parity ledger reports first-action `max_abs` of **5.96e-07** for SmolVLA, **2.09e-07** for pi0, **2.38e-07** for pi0.5, and **8.34e-07** for GR00T — all at machine precision, all reproducible with the same seed.
+
+#### Key metrics explained
+
+**`max_abs_diff` (Maximum Absolute Difference)**
+
+The largest difference between any single output value from PyTorch vs ONNX, across all action dimensions.
+
+- `2.384e-07` means the biggest disagreement was 0.000000238 — practically zero.
+- **Good values:** `< 1e-04` (the default threshold)
+- **Concerning values:** `> 1e-03` — the ONNX model may behave differently in deployment
+- **Failing values:** `> 1e-02` — the export is unreliable; do not deploy
+
+> Mental model: "In the worst case, across all test inputs, how far off was any single predicted joint angle (or thrust value for drones)?"
+
+**`mean_abs_diff` (Mean Absolute Difference)**
+
+The average difference across all output values. Always smaller than or equal to `max_abs_diff`.
+
+- Useful for seeing if the error is concentrated in one spot or spread evenly.
+- If `mean_abs_diff` ≈ `max_abs_diff`, the error is spread evenly (usually fine).
+- If `mean_abs_diff` << `max_abs_diff`, one outlier dimension is noisy (investigate before deploying).
+
+**`Threshold`**
+
+The configurable pass/fail cutoff. Default: `1e-04` (0.0001).
+
+- If `max_abs_diff` < threshold → **PASS**
+- If `max_abs_diff` ≥ threshold → **FAIL**
+
+```bash
+# Override the threshold
+reflex validate ./reflex_export/ --threshold 1e-3  # more lenient
+reflex validate ./reflex_export/ --threshold 1e-5  # stricter
+```
+
+**`Fixtures`**
+
+The number of random test inputs used. Each fixture is a synthetic (image, instruction, state) tuple. More fixtures = higher confidence the parity holds across input distribution.
+
+**`Seed`**
+
+The random seed used to generate fixtures. Same seed + same model + same export settings = identical results. This is what makes the verification **reproducible** by anyone with the same `reflex-vla` version.
+
+---
+
+### Reproducer
+
+```markdown
+## Reproducer
+
+```bash
+reflex export lerobot/smolvla-base --target orin-nano --output <dir>
+reflex validate <dir>
+```
+```
+
+Anyone with the model ID + target + opset listed in the metadata block can reproduce the entire export + validation pipeline from scratch and verify the SHA256s + parity numbers match. If they don't match, either (a) the model on HF was updated, or (b) the runner is on a different Reflex version — the version field at the top tells them which.
+
+---
+
+## Interpreting results by vertical
+
+Different deployments tolerate different levels of `max_abs_diff`. The shipped 1e-04 default is comfortable for most edge-VLA use cases; tightening it for high-precision deployments or loosening it for perception-only models is a deliberate trade-off.
+
+| Vertical | Acceptable `max_abs_diff` | Notes |
+|---|---|---|
+| **Warehouse arms** (Franka / UR5 class) | `< 1e-04` | Tight tolerance for precise pick-and-place at sub-millimeter scale |
+| **Farm robotics / SO-100 class** | `< 1e-03` | More tolerant for coarse outdoor manipulation |
+| **Aerial drones** | `< 1e-04` | Flight control requires high fidelity — small per-step diffs compound at 50 Hz |
+| **Smart-camera deployments** (retail, traffic management) | `< 1e-03` | Classification-oriented — tolerant of small numerical drift in attention layers |
+| **Mining / heavy industrial** | `< 1e-04` | Safety-critical — minimize any deployment-time surprises |
+
+---
+
+## What to do if validation fails
+
+1. **Re-export with default settings:**
+
+   ```bash
+   reflex export <model> --target desktop --precision fp16
+   reflex validate ./reflex_export/
+   ```
+
+   Some custom targets or precisions (`fp8`, `int8`) widen the tolerance gap. Start with `fp16` on the `desktop` target to isolate model issues from quantization issues.
+
+2. **Try a lower opset:**
+
+   ```bash
+   reflex export <model> --opset 17
+   ```
+
+   Some attention-heavy models hit precision-sensitive ops in opset 19; opset 17 falls back to slower but more numerically stable kernels.
+
+3. **Run `reflex doctor`:**
+
+   ```bash
+   reflex doctor --export-dir ./reflex_export/
+   ```
+
+   Catches the common silent-failure modes (cuDNN version skew, TRT EP loadchain breakage, JetPack mismatches) that manifest as parity failures.
+
+4. **File an issue:** If a shipped model in the registry consistently fails validation, open a GitHub issue with the full `VERIFICATION.md` attached + your `reflex doctor` output. The Reflex maintainers can reproduce against the registry's expected hash.
+
+---
+
+## See also
+
+- [`docs/cli_reference.md`](./cli_reference.md) — full flag list for `reflex export` and `reflex validate`
+- [`docs/eval.md`](./eval.md) — task-success eval (parity is a necessary but not sufficient condition; pass eval also)
+- [`docs/doctor_check_list.md`](./doctor_check_list.md) — what `reflex doctor` checks
+- [`docs/adding_a_robot.md`](./adding_a_robot.md) — embodiment cookbook (the embodiment config affects state shape passed to the policy)
+- [`src/reflex/verification_report.py`](../src/reflex/verification_report.py) — the renderer (authoritative)