From 6d83ba6b09b661528fc79d716f75b7afa58a49e2 Mon Sep 17 00:00:00 2001
From: DsThakurRawat <divyanshrawatofficial@gmail.com>
Date: Fri, 15 May 2026 01:11:02 +0530
Subject: [PATCH] docs: explain VERIFICATION.md metrics in plain English
 (closes #35)

---
 docs/understanding_verification.md | 186 +++++++++++++++++++++++++++++
 1 file changed, 186 insertions(+)
 create mode 100644 docs/understanding_verification.md

diff --git a/docs/understanding_verification.md b/docs/understanding_verification.md
new file mode 100644
index 0000000..7a06412
--- /dev/null
+++ b/docs/understanding_verification.md
@@ -0,0 +1,186 @@
+# How to Read Your VERIFICATION.md
+
+> The `VERIFICATION.md` file is your **trust receipt** — it proves that the exported ONNX model produces the same outputs as the original PyTorch checkpoint. This guide explains every section in plain English.
+
+---
+
+## When Is It Created?
+
+`VERIFICATION.md` is auto-generated at two points:
+
+1. **`reflex export`** — creates a skeleton with file hashes but no parity numbers yet
+2. **`reflex validate`** — fills in the numerical parity results
+
+Until you run `reflex validate`, the parity section will say _"Not yet verified."_
+
+---
+
+## Section-by-Section Breakdown
+
+### Export Metadata
+
+```markdown
+- **Model:** `lerobot/smolvla-base`
+- **Model type:** smolvla
+- **Target:** orin-nano
+- **ONNX opset:** 19
+- **Denoising steps (baked in):** 10
+- **Action chunk size:** 50
+- **Reflex version:** 0.2.1
+- **Platform:** Linux-5.15.0-aarch64
+```
+
+| Field | What It Means |
+|---|---|
+| **Model** | The HuggingFace model ID or local path that was exported |
+| **Model type** | Architecture family: `smolvla`, `pi0`, `pi05`, `groot` |
+| **Target** | Hardware the export was optimized for: `orin-nano`, `desktop`, etc. |
+| **ONNX opset** | ONNX operator set version. Higher = more ops available. Standard: 19 |
+| **Denoising steps** | Number of flow-matching denoise iterations baked into the ONNX graph. More steps = higher quality but slower inference |
+| **Action chunk size** | How many future actions the model predicts per inference call |
+| **Reflex version** | The `reflex-vla` package version used for export |
+| **Platform** | OS and architecture where the export was run |
+
+> **For drones:** The action chunk size is typically smaller (20 vs 50) because flight dynamics require faster replanning. The denoising steps may also be lower for latency-sensitive aerial deployments.
+
+---
+
+### Files Table
+
+```markdown
+| File | Size | SHA256 |
+|---|---|---|
+| `model.onnx` | 245.3MB | `a1b2c3d4...` |
+| `reflex_config.json` | 1.2KB | `e5f6a7b8...` |
+```
+
+| Column | What It Means |
+|---|---|
+| **File** | Every file in your export directory (excluding VERIFICATION.md itself) |
+| **Size** | Human-readable file size |
+| **SHA256** | Cryptographic hash — if even one byte changes, this hash changes completely |
+
+**Why SHA256 matters:**
+- **Integrity:** If you download an export from a teammate or CI, compare the SHA256 to confirm nothing was corrupted or tampered with
+- **Reproducibility:** Two exports from the same model + settings should produce identical hashes
+- **Audit trail:** For regulated verticals (warehouse safety, traffic management), SHA256 provides a verifiable chain of custody
+
+---
+
+### Parity Section
+
+This is the most important part — it appears after running `reflex validate`.
+
+```markdown
+## Parity
+
+**Verdict:** PASS
+**Threshold:** 1e-04
+**Fixtures:** 5
+**Seed:** 42
+**max_abs_diff across all fixtures:** 2.384e-07
+
+| Fixture | max_abs_diff | mean_abs_diff | Passed |
+|---|---|---|---|
+| 0 | 1.192e-07 | 3.576e-08 | PASS |
+| 1 | 2.384e-07 | 4.768e-08 | PASS |
+| 2 | 1.192e-07 | 2.980e-08 | PASS |
+| 3 | 1.788e-07 | 4.172e-08 | PASS |
+| 4 | 1.192e-07 | 3.278e-08 | PASS |
+```
+
+#### Key Metrics Explained
+
+**`max_abs_diff` (Maximum Absolute Difference)**
+
+The largest difference between any single output value from PyTorch vs ONNX, across all action dimensions.
+
+- `2.384e-07` means the biggest disagreement was 0.000000238 — practically zero
+- **Good values:** `< 1e-04` (the default threshold)
+- **Concerning values:** `> 1e-03` — the ONNX model may behave differently
+- **Failing values:** `> 1e-02` — the export is unreliable; do not deploy
+
+> Think of it as: "In the worst case, across all test inputs, how far off was any single predicted joint angle (or thrust value for drones)?"
+
+**`mean_abs_diff` (Mean Absolute Difference)**
+
+The average difference across all output values. Always smaller than `max_abs_diff`.
+
+- Useful for seeing if the error is concentrated in one spot or spread evenly
+- If `mean_abs_diff` ≈ `max_abs_diff`, the error is spread evenly (usually fine)
+- If `mean_abs_diff` << `max_abs_diff`, one outlier dimension is noisy (investigate)
+
+**`Threshold`**
+
+The configurable pass/fail cutoff. Default: `1e-04` (0.0001).
+
+- If `max_abs_diff` < threshold → **PASS**
+- If `max_abs_diff` ≥ threshold → **FAIL**
+
+```bash
+# Override the threshold
+reflex validate ./reflex_export/ --threshold 1e-3  # more lenient
+reflex validate ./reflex_export/ --threshold 1e-5  # stricter
+```
+
+**`Fixtures`**
+
+The number of random test inputs used. Each fixture is a synthetic (image, instruction, state) tuple. More fixtures = higher confidence.
+
+**`Seed`**
+
+The random seed used to generate fixtures. Same seed + same model = identical results. This is what makes the verification **reproducible**.
+
+---
+
+### Reproducer
+
+```markdown
+## Reproducer
+
+```bash
+reflex export lerobot/smolvla-base --target orin-nano --output <dir>
+reflex validate <dir>
+```
+```
+
+This section gives anyone the exact commands to reproduce the entire export + validation pipeline from scratch.
+
+---
+
+## Interpreting Results by Vertical
+
+| Vertical | Acceptable `max_abs_diff` | Notes |
+|---|---|---|
+| **Warehouse arms** | `< 1e-04` | Tight tolerance for precise pick-and-place |
+| **Farm robotics** | `< 1e-03` | Slightly more tolerant for coarse outdoor manipulation |
+| **Aerial drones** | `< 1e-04` | Flight control requires high fidelity — small diffs compound at 50 Hz |
+| **Retail cameras** | `< 1e-03` | Perception tasks are more tolerant of small numerical diffs |
+| **Traffic AI** | `< 1e-03` | Classification-oriented — tolerant of action-space drift |
+
+---
+
+## What To Do If Validation Fails
+
+1. **Re-export with default settings:**
+   ```bash
+   reflex export <model> --target desktop --precision fp16
+   reflex validate ./reflex_export/
+   ```
+
+2. **Try a lower opset:**
+   ```bash
+   reflex export <model> --opset 17
+   ```
+
+3. **Check for known issues:** Some models have precision-sensitive attention layers. See [TROUBLESHOOTING.md](./TROUBLESHOOTING.md).
+
+4. **File a bug:** If a shipped model consistently fails validation, open an issue with the full VERIFICATION.md attached.
+
+---
+
+## Further Reading
+
+- [CLI Command Reference](./cli_reference.md) — `reflex export` and `reflex validate` flags
+- [Troubleshooting](./TROUBLESHOOTING.md) — CUDA and export errors
+- [Adding a Robot](./adding_a_robot.md) — embodiment cookbook