Univariate skip for the batched AIR sumcheck (+ finite-difference kernel) by Barnadrot · Pull Request #251 · leanEthereum/leanVM

Barnadrot · 2026-06-11T09:22:35Z

Univariate skip for the batched AIR sumcheck (+ finite-difference kernel)

Branch pw13-clean Two kept mechanisms, both targeting the AIR-constraint-evaluation cost center identified by profiling (the batched AIR sumcheck was 416 ms = 20.7 % of a 2.01 s prove on the reference workload).

Introduction — the mechanisms in one paragraph each

Mechanism 1 — univariate skip for the batched AIR sumcheck (h1, commits T1–T7). The AIR zerocheck paid extension-field constraint evaluations from round 1 onward (measured: rounds ≥ 1 = 83 % of the sumcheck's cost; EF mult ≈ 15× base). We replace the first K = 4 rounds with ONE univariate round over the 16-point integer window {0..15} (Gruen, eprint 2024/108 §5): the prover evaluates all tables' constraints at 31 nodes entirely in the base field with AVX-512 packing, sends a single combined coefficient vector, and binds 4 variables with one challenge. Table batching switches from back-loaded to front-loaded (static weights w_t = 2^(n_max−n_t); no challenge products), and the WHIR opening claims become tensor-weight statements (Lagrange₁₆(r₀) ⊗ eq) via a new optional tail on SparseStatement. The extension-field round tail shrinks 16×.

Mechanism 2 — finite-difference extension kernel (h4, commit U1). Inside the skip round, extending each column's 16 window values to the 135 extended integer nodes was done with per-node Lagrange dots (16 packed Montgomery muls per column per node). Newton forward-difference cascades replace this with 15 packed adds per column per node (exact field arithmetic — proofs are bit-identical, this is kernel-only). Applied at all three interpolation sites (columns, degree-split cached state, low-evals accumulator).

Results vs `main` (`191e1a5`)

All rows from ONE self-consistent paired run of main vs branch tip by eval_paired.sh (current harness method: arena-enabled prove_loop, env preflight, A-vs-A noise floor 0.43 %, 3 counterbalanced rounds × 4 warm proofs each side, Welch's t-test), 2026-06-11.

Metric	main (`191e1a5`)	branch (`ff7eeb2`)	Δ
Native prover (prove_loop warm avg, 1550 sigs)	2.0905 s	1.8435 s	−11.82 % (t = −75.1, p ≈ 0)
— steady-state confirmation (250 proofs/side, last 50)	2.1454 s	1.8993 s	−11.47 % ("improved")
XMSS throughput (1550 / warm avg)	741 XMSS/s	841 XMSS/s	+13.4 %
Recursion (`recursion --n 2 --log-inv-rate 2`, total)	3.521 s	3.298 s	−6.33 %
Recursion aggregation-node cycles	219,433	218,819	−614 (in-circuit verification got cheaper)
Proof size (warm, KiB)	492	492	+0.0 % (h1's +2.1 KiB transcript is below integer-KiB resolution; h4 is bit-identical)
Recursion bytecode	209,762 instr (log_size 18)	213,227 instr (log_size 18)	+1.65 %, no power-of-2 crossing

Per-iteration gate verdicts when each mechanism landed (different intermediate baselines): h1 −6.66 % (p ≈ 0, steady-state −5.68 %, recursion −4.08 %; pre-fix harness timing method), h4 −3.10 % (p = 2.9e-05, recursion −3.25 %, proofs bit-identical; current method). The table above is the direct A–B comparison and is the defensible headline.

Measurements on M4 Mac Mini 32 GB

Metric	main (`ec1f090`)	pw13-clean (`0bbc8cc`)	Δ
Native prover (lean-multisig xmss, 1550 sigs, 3-run avg)	2.210 s	1.910 s	−13.59 %
XMSS throughput (1550 / avg)	701 XMSS/s	812 XMSS/s	+15.7 %
Recursion (recursion --n 2 --log-inv-rate 2, total)	3.808 s	3.579 s	−6.01 %
Recursion aggregation-node cycles	213,207	219,433	+6,226 (+2.9%, in-circuit verification slightly more expensive)
Proof size (JSON format, KiB)	344	344	+0.0 %
Recursion bytecode	209,762 instr (log_size 18)	213,227 instr (log_size 18)	+1.65 %, no power-of-2 crossing

Validation methods used by the harness (for reviewers)

Correctness gate (harness/leanvm/correctness/correctness.sh): Layer 0 test-file integrity (SHA-pinned), 0.5 crypto-parameter guard, 1 clippy -D warnings, 2 field/backend tests, 3 structural invariants (column counts, bus widths, logup overflow, WHIR config Rust↔Python consistency), 4 WHIR roundtrips, 5 aggregation e2e (test_aggregation: real XMSS proofs + in-circuit verification), 6 free-variable soundness (soundness_check: every committed/virtual Poseidon column constrained), 7 transcript-mutation fuzzer (200 random mutations → all must be rejected; run at 3 seeds across the iterations, 0 accepted total). Layers 0.7/0.8 (protected-file diff hygiene + frozen-main differential verification) fail by design for any transcript-shape change — this PR is the human-review path those layers prescribe.
Python verifier as executable spec (crates/lean_prover/python-verifier/): re-dumped vectors (5 shapes spanning table-height mixes and rates 1–4) all verify; independent tamper sweeps (92 + 7 positions) all rejected, each at the expected assertion.
Performance gate (harness/leanvm/scripts/eval_paired.sh): env preflight (governor/load), A-vs-A noise floor (≤ 0.3 %), N=3 counterbalanced rounds × 4 warm proofs, Welch's t-test (keep iff Δ ≤ −1 %, p < 0.01), recursion-regression ceiling (+3 %), proof-size ceiling (+20 %) with 3× net penalty; auto-chained 250-proof steady-state confirmation.
Pre-wiring kill-gates: each mechanism was benchmarked against the exact code it replaces before any protocol wiring (h1: skip kernel 234 ms vs legacy rounds 403 ms; h4: FD 125.9 ms vs Lagrange 212.4 ms), with bit-identity or brute-force-reference equality tests pinning correctness independently of the protocol path.
Per-commit independent review: every commit was reviewed by an independent agent against the written plan spec (plan_spec.md), including full re-derivations of the front-loading bookkeeping, the FD anchor invariant, the closed-form next_mle_with_tail (numerically confirmed 144/144), and the leaf/tensor index conventions.

Mechanism 1 (h1) — soundness in detail

Security regime (unchanged): 124 bits, Johnson bound. |EF| = p⁵ ≈ 2^154.6 (KoalaBear quintic, X⁵+X²−1). WHIR parameters, RS domains, query counts, grinding bits, Merkle arity, WHIR_CONFIGS, and the Poseidon permutation are bit-identical to main. Full write-up: report/hypothesis_1/security_regime.md.

What changes, protocol-wise. The batched AIR sumcheck proves Σ_t Σ_{x∈H_{n_t}} eq(gkr_pt_t, x)·C_t(cols_t(x)) = Σ_t S_t, where S_t are the per-table bus claims the verifier computes itself from the LogUp-GKR outputs (verify_execution.rs:113–132, unchanged arithmetic).

Front-loaded batching. Every table joins at round 0 with static weight w_t = 2^(n_max−n_t); a finished table contributes claim/2 per remaining round, folded into the constant coefficient. The verifier's per-round identity h(0)+h(1) = target and its c₀-reconstruction are untouched; the final combined value is Σ_t s_t^final with no challenge products (replaces back_loaded_table_contribution). Soundness argument is identical to the back-loaded scheme — the per-table multiplier is a monomial in the unused variables; per-round Schwartz–Zippel error unchanged at ≤ 12/|EF|.
The skip round (Gruen 2024/108 §5–6, soundness A.4). The first K=4 variables of every table (the low row-index bits; their eq coordinates are the last K entries of gkr_point — identical across tables since all eq factors are suffixes of it) are bound by one univariate round. The prover sends the combined polynomial v′(X) = Σ_t w_t·v′t(X), deg ≤ (2^K−1)·d_max = 150, as 151 coefficients. The verifier checks Σ{z∈D} ê(z)·v′(z) = Σ_t w_t·S_t where ê is the degree-15 interpolation of eq(gkr_top, bits(·)) over D — an exact identity — then samples r₀ ∈ EF and continues with target ê(r₀)·v′(r₀). Cheating error for the round: ≤ deg(v′)/|EF| ≈ 150/2^154.6 ≈ 2^−147.4. Per-table claims stay α-separated exactly as on main (the air_alpha powers are disjoint per table); sending the combined polynomial is the same compression every other round already uses.
Remaining rounds (n_max−K, degree ≤ 12): protocol-identical to main, strictly fewer rounds.
Opening claims. Each table's columns are opened at the tensor point (Lagrange₁₆(r₀) on the last 4 coordinates) ⊗ eq(prefix challenges). WHIR's SparseStatement gains an optional tail: Vec<EF>; the weight is eq(point) ⊗ MLE(tail) — still a public, verifier-evaluable multilinear weight, i.e. exactly the statement class WHIR already proves (constrained-RS codes with arbitrary public weights; Arnon–Chiesa–Fenzi–Yogev 2024/1586). tail = None paths are bit-identical to main (enforced by behavioral-equivalence tests: a tailed statement with tail = eq-expansion of 4 challenges is byte-equivalent to the plain statement with those challenges appended). Shift("next") statements use the shift-by-one of the tail-seeded eq vector, validated against Σ_x tail[x]·next_mle identities.
Fiat–Shamir. Skip coefficients are absorbed before r₀ is squeezed (same absorb→sample discipline as every round, same duplex sponge). Transcript grows ≈ +107 EF (+0.6 %); measured proof-size delta at the gate: 0.0 KiB at integer-KiB resolution.
Recursion. The zkDSL circuit verifies the identical protocol (compiled from the same identities; Lagrange weights via compile-time-inverted denominators — no in-circuit inversions anywhere). Bytecode +1.65 %, log_size 18 unchanged; aggregation-node cycles went down (−614) because one grinding event replaces four and three sumcheck rounds disappear.

Added soundness error, total: < 2^−146 against a 2^−124 budget — absorbed without moving the regime. The mutation fuzzer and the python tamper sweeps exercise the new verifier checks directly (window-identity rejection observed at the exact assertion).

Mechanism 2 (h4) — soundness in detail

There is no protocol surface: the FD cascade computes the same field elements the Lagrange dots computed (Newton forward differences on unit-spaced nodes; exact arithmetic in odd characteristic; the division-free advance is add-only). Equality is enforced, not assumed: test_fd_extension_matches_lagrange pins bit-identical full coefficient vectors across the generic, degree-split, and unpacked kernels, the padding path, and K ∈ {3,4,5} (10 configurations), and the entire iter-1 test battery (window-sum identities vs brute-force cube sums, post-skip state vs naive references, e2e roundtrips, production prove+verify) runs unmodified on the FD path. The independent review re-derived the anchor invariant (row_i = Δ^{n−1−i}p(i), ascending advance) and the degree-split lockstep node accounting. Proofs produced before/after the commit are byte-identical; the verifier is untouched.

What needs human audit (and what has already been independently checked)

An independent reviewer (Opus-class agent, read-level + hand-derivation) performed a soundness double-check of this branch. Verdict: no exploitable break found; diligence review, not a proof. The calibration below is theirs and we adopt it: the existing test suite demonstrates completeness (honest proofs verify) and single-tamper rejection (200/200 fuzzer, 5/5 python vectors, per-coefficient corruption tests) — neither implies soundness against a prover genuinely trying to cheat. The L0.7/L0.8 gates are tripwires ("protected files were touched"), not soundness checks. This change is not production-safe until a human cryptographer signs off.

Already independently verified

The core forgery reduction, re-derived from scratch: a cheating P = P_honest + Q must satisfy Σ_{z∈D} ê(z)·Q(z) = 0 (round-0 identity, exact) yet needs ê(r₀)·Q(r₀) = 0 to survive the continuation; deg(ê·Q) ≤ 165 ⇒ success ≤ 165/|EF| ≈ 2^−147.
Three-verifier agreement: Rust, the Python spec, and the recursion circuit enforce the same checks with the same constants — critically, the in-circuit (on-chain) verifier carries both load-bearing assertions (round-0 window identity at recursion.py:353; final-value equality at :423; copy_ef = write-once equality). A tighter-Rust/looser-circuit hole was the top concern and is ruled out.
Hard-constraint compliance, byte-level: poseidon1_koalabear_16.rs and lean_prover/src/lib.rs identical to baseline (SECURITY_BITS=124, GRINDING_BITS=16, WHIR folding 7/5, RS reduction 5, all round data); Python WHIR_CONFIGS unchanged. No silent parameter weakening — the single most damaging failure mode for an auto-optimizer, checked first.
The tensor-tail weight class: eq(point,·)·MLE(tail)(·) is degree-1 per variable, in WHIR's existing constraint class; the shift variant's reduction to shift-by-one of eval_eq_with_tail with wraparound was identity-checked.

Where the audit should aim (residual risk is narrow and localized):

The next-with-tail (shift-column) weight across the skip block — matrix_next_mle_folded_with_tail (backend/poly/src/next_mle.rs) and especially the recursion circuit's hand-factored closed form in zkdsl_implem/utils.py, corner term prod_corner·(prod_y_low − eq_low[0]). Status: verified by algebraic identity + honest-input agreement (test_aggregation) + 144/144 numerical cross-check against the naive sum — which establishes the honest value is right, not that no wrong shift value can satisfy the check. That property needs independent re-derivation.
Adversarial soundness tests are missing — recommended pre-audit addition: (a) an invalid trace (AIR-violating witness) plus a best-effort forged combined polynomial P, confirming rejection through the full pipeline; (b) a wrong-shift-value rejection test targeting the tail corner specifically. These convert "no break found by reading" into "these specific break attempts are mechanically rejected."
The front-loading bookkeeping (w_t weights, finished-table halving) and the FD kernel have been re-derived twice each in per-commit reviews; lower priority but the derivations are one-page checks against plan_spec.md if desired.

Discarded mechanisms (investigated, killed or reverted — no residue on the branch)

h8 — univariate skip for WHIR's initial folding sumcheck (implemented fully, discarded at the gate, reverted cleanly)

Applied the same skip to the first 4 of WHIR's 7 initial folding rounds, with code-domain semantics as a 16-ary shard fold (same Merkle leaves) and soundness via BCIKS 2020/654 Thm 1.5/7.2 — correlated agreement for degree-15 parameterized curves, the same conjecture class the deployed gamma-power statement batching already assumes at degree ≈ 190 (security write-up: report/hypothesis_8/security_regime.md). All 5 commits passed independent review; recursion cycles below baseline; mutation fuzzer clean. Discarded on measurement: −1.23 % throughput (vs −5.5 predicted) and +4 KiB proof → net −1.21 % after the 3× size penalty. Root cause (concept, not implementation): compute savings in the memory-bandwidth-bound WHIR region convert to wall-clock at only ~0.4× (the AIR region converts ~1:1), and transcript deltas multiply per WHIR instance in aggregate bundles. Reverted as 00032320..ff7eeb2d; full post-mortem report/iter3_summary.md.

h2 — GKR offload of the Poseidon permutation (killed at planning)

Removing the 84 intermediate Poseidon columns in favor of a 28-layer data-parallel GKR would add +29.5–40 KiB of transcript (28 constant-width 2^22 layers × ~71 EF each) against a +5 KiB budget — structurally irreducible (fusing layers grows bytes as 3^k; round-stacking would require committing the very intermediates being removed). Commit cost would not drop (ν stays 26: 60.2 M cells → 38 M, boundary at 33.5 M). report/hypothesis_2/gkr-poseidon-offload.md.

h3 — vectorized (block-4) memory bus for Poseidon I/O (killed at audit)

Block lookups require 4-aligned operand pointers; a trace audit of the production workload measured 99 % / 67 % / 60 % / 91 % of the four Poseidon operand classes unaligned (the guest allocator is a raw bump pointer; the preamble alone is ≡ 1 mod 4). An alignment retrofit is an allocation-policy change (out of scope) with a 2^22 memory-boundary downside. The protocol side (block lane coexistence, tensor fingerprints, domain separation) was verified feasible and is documented for any future ISA that guarantees alignment. report/hypothesis_3/vectorized-memory-bus.md.

h9 — power-circuits rewiring of the LogUp fraction tree (killed at in-pipeline measurement, zero protocol surface)

Soukhanov 2023/1611's multiplication-via-squares wiring for the GKR quotient tree. Killed in ~3 hours by the in-pipeline kill-gate: the squaring kernel measured 2.55× slower than production inside the real prove (packed quintic squaring costs ~1.7–1.8× of a mul on Zen 4 — ILP-bound; the 0.6× literature assumption does not hold), plus three pre-registered code-fact refutations (the GKR transcript already sends Gruen-elided 2 EF/round; the eq machinery already has Gruen + split-eq; the 7/4 state growth crosses the logup 2^25 boundary). Permanent diagnostic: per-layer GKR cost attribution. report/hypothesis_9/killgate.md.

Triage-killed without implementation

h5 (bottom-GKR-layer skip): bottom-layer sumcheck measured 70–78 ms, but denominators are EF at every round, and the 0.4× region conversion prices it ≈ −0.6 % — below the gate floor.
h6 (Poseidon degree-10→9 output regating): the pre-wiring soundness sketch showed the "small regating" actually requires mode-dependent pull multiplicities (logup-wide change) for ≈ −0.8 % post-h1/h4 — below the gate floor. Mechanism documented as sound for future workload scales; report/hypothesis_6/soundness_sketch.md.
h7 (parallelize acc builds): −0.8 % predicted < 1 % gate threshold by construction.
Memory-argument replacements (Twist/Shout, Lasso, Spice): economics assume MSM commitments (zeros free); under hash-based WHIR the sparse machinery costs orders of magnitude more hashed cells than the entire current bus. report/papers/iter_4/notes.md.

Reviewer reproduction quick-start

# correctness battery (expect L0.7/L0.8 to flag this PR's protected-file diffs — that's this review)
bash harness/leanvm/correctness/correctness.sh

# performance, main vs branch tip
bash harness/leanvm/scripts/eval_paired.sh --baseline 191e1a5b --candidate ff7eeb2d --n 3

# python verifier (executable spec) on freshly dumped vectors
cargo test --release -p lean_prover --lib -- test_zkvm::dump_test_vectors_for_python_verifier --include-ignored
python3 crates/lean_prover/python-verifier/test_verify.py

# the pre-wiring kill-gate benches
cargo test -p sub_protocols --release --test air_sumcheck_skip_kernel -- --ignored --nocapture

Protected-file diff summary (the human-review surface)

File	Δ	Nature
`crates/lean_prover/src/verify_execution.rs`	+44/−40	uniskip verifier call, pinned final-check formula, tailed claims; every original check preserved (bounds, statement-count assert, full-consumption)
`crates/whir/src/open.rs`	+15/−8	tail dispatch in combine_statement; fast paths excluded for tailed statements
`crates/whir/src/verify.rs`	+14/−4	tail factor in weight evaluation; `tail=None` arms expression-identical to main
`crates/sub_protocols/src/air_sumcheck.rs`	13 lines	visibility-only (`pub(crate)`); zero logic change; legacy entry points preserved verbatim

Everything else (fiat-shamir, koala-bear/Poseidon, WHIR configs, stack_polynomials_and_commit, structural AIR methods) is zero-diff — enforced by the Layer 0.5/0.7 guards and verifiable by git diff main...ff7eeb2d --stat.

Limitations & follow-ups

Measured on two machine class (Hetzner AX42U, Zen 4 AVX-512, 16c + M4 Mac Mini 32GB Scaleway); the skip kernel's win is arithmetic-bound transfers, but the FD/Lagrange ratio is ILP-sensitive.
The K = 4 window is compile-time (UNIVARIATE_SKIP_K, single source Rust→Python→zkDSL); a K ∈ {3,5} sweep was analyzed (K = 4 near-optimal with FD; K = 5 is a net loss) but not gate-measured.
The two harness timing methods differ between iteration gates (documented above); the table at the top is one self-consistent paired run.
Region conversion factors (AIR ≈ 1:1, WHIR/GKR ≈ 0.4) and per-layer GKR attribution are recorded in the experiment logs for future optimization pricing.

…r-tail eq/next expansion)

…no-go timing gate) Skip-round prover kernel (Gruen 2024/108 §5) on AirSumcheckSession: compute_skip_poly (generic + degree-split packed kernels, scalar fallback for tables with n-K <= packing width) and process_skip_challenge (Lagrange 2^K->1 block fold + legacy-state fast-forward). Window nodes are read directly from the chunk-bit-reversed storage (block of cube point x = bitreverse_K(x)); extended integer targets via base-field Lagrange dots. Poseidon's low_degree_air path generalizes the existing degree-split: post-block state interpolated from the 2^K window states, low part from the first low_degree*(2^K-1)+1 nodes. GO/NO-GO timing gate (2^18x110 poseidon + 2^20x22 execution, K=4): PASS poseidon16: skip 183.09 ms vs legacy rounds 0..3 311.62 ms execution : skip 51.30 ms vs legacy rounds 0..3 91.17 ms combined : 234.39 ms vs 402.80 ms air_sumcheck.rs: visibility-only changes (protected file, see plan_spec): 10 fields + pivot/in_phase_1/permuted_alphas to pub(crate). No logic. 7 new correctness tests (per-node identities vs brute force, extended-node spot checks, post-skip sum/bare-poly/final-eval equivalence, padding, shift columns, unpacked fallback, block map); existing suites untouched and passing; clippy -D warnings clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…rove+verify) Protocol (plan_spec.md): all tables join at round 0 with static weights w_t = 2^(n_max - n_t); round 0 sends ONE combined skip polynomial P = sum_t w_t * v'_t IN FULL (the round-0 identity is the weighted window sum, not h(0)+h(1), so no coefficient is elided); finished tables contribute constant c_t * 2^(m-1) folded into coeffs[0] so the verifier's c0 reconstruction is untouched; final combined value = sum_t s_t^final with NO challenge products. Verifier half reuses sumcheck_verify verbatim for the linear rounds. Final identity (pinned by the e2e test, installed in verify_execution by T5): final_target == sum_t e_hat(r0) * eq(eq_factor_t[..n_t-K], natural_prefix_t) * C_t(col_evals_t), natural_prefix_t = reverse(linear_challenges[..n_t-K]). e2e test: 3 real tables (execution n=13 with shift cols + analytic padding, extension_op n=9, poseidon16 n=10 degree-split), brute-forced claims, full prove->verify roundtrip at K in {3,4}; adversarial: tampered skip coefficient rejected at the round-0 window identity, tampered linear-round coefficient rejected at the final identity (c0 elision absorbs per-round wire tampering by construction). No protected files touched; legacy prove_batched_air_sumcheck verbatim.

…skip openings) SparseStatement gains optional tail: Option<Vec<EF>> — inner weight becomes eq(point, hi) * MLE(tail)(lo) on the lowest log2(tail.len()) inner variables (is_next: its shift-by-one). Constructors new_with_tail/new_next_with_tail; inner_num_variables() includes the tail width; tail=None behavior unchanged (enforced by point-append equivalence tests + all existing suites). whir/open.rs, whir/verify.rs: protected files — surgical diffs (fast-path guard + general-branch dispatch in combine_statement; weight-eval tail factor in eval_constraints_poly + defensive debug_assert in verify_constraint_coeffs), see plan_spec invariant 6. open.rs +15 lines, verify.rs +14 lines. Tests (crates/whir/tests/tensor_tail.rs, n=18 production-shaped config): tail==point-append equivalence (eq + next), random-tail naive-cube-sum roundtrip + corruption rejection, mixed statements (dense fast path + plain sparse + tailed eq k=4 + tailed next k=3) in one prove.

CommittedStatements entries become CommittedClaim {point, tail, eq_values, next_values}; AIR opening claims carry the Lagrange-window tensor tail (weight = eq(natural_prefix) x L(r0)); logup claims are tail-less. Prover calls prove_batched_air_sumcheck_uniskip (K=4); verifier checks the round-0 window identity then the pinned final formula target == sum_t e_hat(r0) * eq(bus_point[..n_t-K], natural_prefix_t) * C_t(col_evals_t) and back_loaded_table_contribution is removed (front-loading has no challenge-product k_t). stacked_pcs threads tails into SparseStatement::new_with_tail / new_next_with_tail; statement COUNT unchanged (total_whir_statements assert still passes). verify_execution.rs: protected file - surgical protocol-spec diff (uniskip verifier call + pinned final-check formula + tailed claims), see plan_spec invariant 6 and report/hypothesis_1/security_regime.md. Tests: lean_prover 4/4 (incl. min-size tables), sub_protocols/whir/lean_vm suites green, workspace clippy -D warnings clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Mirrors the Rust uniskip AIR sumcheck (T1-T5) in the executable spec: - SKIP_K = 4 (must equal sub_protocols::UNIVARIATE_SKIP_K) - verify_air_sumcheck_with_skip: combined coefficient vector (length (2^K-1)*d_max+1) read before sampling r0; round-0 identity sum_{z in D} e_hat(z)*P(z) == sum_t 2^(n_max-n_t)*s_t over the integer window D = {0..2^K-1}; target = e_hat(r0)*P(r0); remaining n_max-K rounds via the unchanged verify_sumcheck - SparseStatements.tail (eq(point) tensor MLE(tail) on the lowest log2(len(tail)) inner variables) + tailed weight evaluation in verify_whir (next_mle_with_tail for shifted statements) - AIR final identity: e_hat(r0) * eq(gkr suffix prefix, natural_prefix) * C_t(col_evals); AIR claims carry the Lagrange-weight tail Validation: all 5 re-dumped vectors verify OK; transcript tamper sweep rejects at every position (skip-region tamper fires 'AIR skip round: weighted window identity failed').

…pilation) In-circuit mirror of the uniskip AIR sumcheck (executable spec: python-verifier verify_air_sumcheck_with_skip, T5 Rust verifier): - front-loaded per-table weights w_t = 2^(n_max-n_t) on the claimed sum - round 0: combined coefficient vector (151 EF) read before r0; window evals P(z) via compile-time z-power tables + dot_product_be; weighted window identity vs Sum_t w_t*s_t; Lagrange weights L_x(r0) with build-time-inverted constant denominators (no in-circuit inversion); target = e_hat(r0)*P(r0) - linear rounds: sumcheck_verify_reversed(n_max - SKIP_K); natural prefix = contiguous slice all_challenges[n_linear-(log_h-K)..] (reversed-storage) - final identity: e_hat(r0) * eq(bus_point[..log_h-K], natural_prefix) * C_t; k_t challenge products removed - WHIR AIR-claim weights: eq(prefix) x MLE(Lagrange tail) on the lowest SKIP_K inner coords; shifted claims via next_mle_with_tail (closed form = scalar evaluation of poly::matrix_next_mle_folded_with_tail, documented in utils.py) - compilation.rs placeholders: SKIP_K, SKIP_Z_POWERS, SKIP_LAGRANGE_C (sourced from sub_protocols::UNIVARIATE_SKIP_K + build-time KoalaBear arithmetic) Bytecode: 213,227 instructions (was 209,762, +1.65%), log_size 18 unchanged, self-referential compile converged on first guess. Recursion bench (--n 2, rate 2): total 3.344s vs 3.503s baseline (-4.5%); aggregation node 219,433 cycles (was 213,207, +2.9% = ~3.1k cycles per verified proof); leaf proofs -4.3%/-6.1% prove time; proof sizes +2-3 KiB (the 151-EF skip vector). Validation: test_recursive_aggregation_extension_table_bigger_than_execution passes; all package suites green; python verifier 5/5 vectors.

Replaces the Lagrange-coefficient dot products that extend each column's 2^K window values to the extended consecutive-integer nodes with Newton forward-difference cascades, at all three interpolation sites of the skip kernels: column values (in place on the gathered window buffer), the degree-split cached state (lockstep cascade), and the low-part accumulator (width-1 cascade over its n_low values). (2^K-1) packed adds per column per node instead of 2^K packed Montgomery muls; exact field arithmetic, outputs bit-identical to the Lagrange path (pinned by test_fd_extension_matches_lagrange across all kernels, padding, and K in {3,4,5}, plus the unchanged iter-1 suites). Lagrange kernels kept as private _lagrange reference functions behind the doc(hidden) compute_skip_poly_forced test hook. Prover-only; zero protocol/transcript change; no protected files touched. h4 kill-gate bench (this machine, --ignored skip_kernel_timing): poseidon16 2^18x110: skip-poly FD 103.4 ms vs Lagrange 170.3 ms execution 2^20x22 : skip-poly FD 22.5 ms vs Lagrange 42.1 ms h4 gate: FD 125.9 ms vs Lagrange 212.4 ms -> PASS (-40.7%) full skip path 150.2 ms vs legacy rounds 408.2 ms -> PASS

Barnadrot and others added 9 commits June 11, 2026 11:06

pw13-1 T1: backend primitives for univariate skip (skip domain, tenso…

b8282cb

…r-tail eq/next expansion)

style: cargo fmt

eaa5d3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Univariate skip for the batched AIR sumcheck (+ finite-difference kernel)#251

Univariate skip for the batched AIR sumcheck (+ finite-difference kernel)#251
Barnadrot wants to merge 9 commits into
leanEthereum:mainfrom
Barnadrot:pw13-clean

Barnadrot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Barnadrot commented Jun 11, 2026

Univariate skip for the batched AIR sumcheck (+ finite-difference kernel)

Introduction — the mechanisms in one paragraph each

Results vs main (191e1a5)

Validation methods used by the harness (for reviewers)

Mechanism 1 (h1) — soundness in detail

Mechanism 2 (h4) — soundness in detail

What needs human audit (and what has already been independently checked)

Discarded mechanisms (investigated, killed or reverted — no residue on the branch)

Reviewer reproduction quick-start

Protected-file diff summary (the human-review surface)

Limitations & follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Results vs `main` (`191e1a5`)