Skip to content

Univariate skip for the batched AIR sumcheck (+ finite-difference kernel)#251

Open
Barnadrot wants to merge 9 commits into
leanEthereum:mainfrom
Barnadrot:pw13-clean
Open

Univariate skip for the batched AIR sumcheck (+ finite-difference kernel)#251
Barnadrot wants to merge 9 commits into
leanEthereum:mainfrom
Barnadrot:pw13-clean

Conversation

@Barnadrot

Copy link
Copy Markdown
Contributor

Univariate skip for the batched AIR sumcheck (+ finite-difference kernel)

Branch pw13-clean Two kept mechanisms, both targeting the AIR-constraint-evaluation cost center identified by profiling (the batched AIR sumcheck was 416 ms = 20.7 % of a 2.01 s prove on the reference workload).

Introduction — the mechanisms in one paragraph each

Mechanism 1 — univariate skip for the batched AIR sumcheck (h1, commits T1–T7). The AIR zerocheck paid extension-field constraint evaluations from round 1 onward (measured: rounds ≥ 1 = 83 % of the sumcheck's cost; EF mult ≈ 15× base). We replace the first K = 4 rounds with ONE univariate round over the 16-point integer window {0..15} (Gruen, eprint 2024/108 §5): the prover evaluates all tables' constraints at 31 nodes entirely in the base field with AVX-512 packing, sends a single combined coefficient vector, and binds 4 variables with one challenge. Table batching switches from back-loaded to front-loaded (static weights w_t = 2^(n_max−n_t); no challenge products), and the WHIR opening claims become tensor-weight statements (Lagrange₁₆(r₀) ⊗ eq) via a new optional tail on SparseStatement. The extension-field round tail shrinks 16×.

Mechanism 2 — finite-difference extension kernel (h4, commit U1). Inside the skip round, extending each column's 16 window values to the 135 extended integer nodes was done with per-node Lagrange dots (16 packed Montgomery muls per column per node). Newton forward-difference cascades replace this with 15 packed adds per column per node (exact field arithmetic — proofs are bit-identical, this is kernel-only). Applied at all three interpolation sites (columns, degree-split cached state, low-evals accumulator).

Results vs main (191e1a5)

All rows from ONE self-consistent paired run of main vs branch tip by eval_paired.sh (current harness method: arena-enabled prove_loop, env preflight, A-vs-A noise floor 0.43 %, 3 counterbalanced rounds × 4 warm proofs each side, Welch's t-test), 2026-06-11.

Metric main (191e1a5) branch (ff7eeb2) Δ
Native prover (prove_loop warm avg, 1550 sigs) 2.0905 s 1.8435 s −11.82 % (t = −75.1, p ≈ 0)
— steady-state confirmation (250 proofs/side, last 50) 2.1454 s 1.8993 s −11.47 % ("improved")
XMSS throughput (1550 / warm avg) 741 XMSS/s 841 XMSS/s +13.4 %
Recursion (recursion --n 2 --log-inv-rate 2, total) 3.521 s 3.298 s −6.33 %
Recursion aggregation-node cycles 219,433 218,819 −614 (in-circuit verification got cheaper)
Proof size (warm, KiB) 492 492 +0.0 % (h1's +2.1 KiB transcript is below integer-KiB resolution; h4 is bit-identical)
Recursion bytecode 209,762 instr (log_size 18) 213,227 instr (log_size 18) +1.65 %, no power-of-2 crossing

Per-iteration gate verdicts when each mechanism landed (different intermediate baselines): h1 −6.66 % (p ≈ 0, steady-state −5.68 %, recursion −4.08 %; pre-fix harness timing method), h4 −3.10 % (p = 2.9e-05, recursion −3.25 %, proofs bit-identical; current method). The table above is the direct A–B comparison and is the defensible headline.

Measurements on M4 Mac Mini 32 GB

Metric main (ec1f090) pw13-clean (0bbc8cc) Δ
Native prover (lean-multisig xmss, 1550 sigs, 3-run avg) 2.210 s 1.910 s −13.59 %
XMSS throughput (1550 / avg) 701 XMSS/s 812 XMSS/s +15.7 %
Recursion (recursion --n 2 --log-inv-rate 2, total) 3.808 s 3.579 s −6.01 %
Recursion aggregation-node cycles 213,207 219,433 +6,226 (+2.9%, in-circuit verification slightly more expensive)
Proof size (JSON format, KiB) 344 344 +0.0 %
Recursion bytecode 209,762 instr (log_size 18) 213,227 instr (log_size 18) +1.65 %, no power-of-2 crossing

Validation methods used by the harness (for reviewers)

  1. Correctness gate (harness/leanvm/correctness/correctness.sh): Layer 0 test-file integrity (SHA-pinned), 0.5 crypto-parameter guard, 1 clippy -D warnings, 2 field/backend tests, 3 structural invariants (column counts, bus widths, logup overflow, WHIR config Rust↔Python consistency), 4 WHIR roundtrips, 5 aggregation e2e (test_aggregation: real XMSS proofs + in-circuit verification), 6 free-variable soundness (soundness_check: every committed/virtual Poseidon column constrained), 7 transcript-mutation fuzzer (200 random mutations → all must be rejected; run at 3 seeds across the iterations, 0 accepted total). Layers 0.7/0.8 (protected-file diff hygiene + frozen-main differential verification) fail by design for any transcript-shape change — this PR is the human-review path those layers prescribe.
  2. Python verifier as executable spec (crates/lean_prover/python-verifier/): re-dumped vectors (5 shapes spanning table-height mixes and rates 1–4) all verify; independent tamper sweeps (92 + 7 positions) all rejected, each at the expected assertion.
  3. Performance gate (harness/leanvm/scripts/eval_paired.sh): env preflight (governor/load), A-vs-A noise floor (≤ 0.3 %), N=3 counterbalanced rounds × 4 warm proofs, Welch's t-test (keep iff Δ ≤ −1 %, p < 0.01), recursion-regression ceiling (+3 %), proof-size ceiling (+20 %) with 3× net penalty; auto-chained 250-proof steady-state confirmation.
  4. Pre-wiring kill-gates: each mechanism was benchmarked against the exact code it replaces before any protocol wiring (h1: skip kernel 234 ms vs legacy rounds 403 ms; h4: FD 125.9 ms vs Lagrange 212.4 ms), with bit-identity or brute-force-reference equality tests pinning correctness independently of the protocol path.
  5. Per-commit independent review: every commit was reviewed by an independent agent against the written plan spec (plan_spec.md), including full re-derivations of the front-loading bookkeeping, the FD anchor invariant, the closed-form next_mle_with_tail (numerically confirmed 144/144), and the leaf/tensor index conventions.

Mechanism 1 (h1) — soundness in detail

Security regime (unchanged): 124 bits, Johnson bound. |EF| = p⁵ ≈ 2^154.6 (KoalaBear quintic, X⁵+X²−1). WHIR parameters, RS domains, query counts, grinding bits, Merkle arity, WHIR_CONFIGS, and the Poseidon permutation are bit-identical to main. Full write-up: report/hypothesis_1/security_regime.md.

What changes, protocol-wise. The batched AIR sumcheck proves Σ_t Σ_{x∈H_{n_t}} eq(gkr_pt_t, x)·C_t(cols_t(x)) = Σ_t S_t, where S_t are the per-table bus claims the verifier computes itself from the LogUp-GKR outputs (verify_execution.rs:113–132, unchanged arithmetic).

  1. Front-loaded batching. Every table joins at round 0 with static weight w_t = 2^(n_max−n_t); a finished table contributes claim/2 per remaining round, folded into the constant coefficient. The verifier's per-round identity h(0)+h(1) = target and its c₀-reconstruction are untouched; the final combined value is Σ_t s_t^final with no challenge products (replaces back_loaded_table_contribution). Soundness argument is identical to the back-loaded scheme — the per-table multiplier is a monomial in the unused variables; per-round Schwartz–Zippel error unchanged at ≤ 12/|EF|.
  2. The skip round (Gruen 2024/108 §5–6, soundness A.4). The first K=4 variables of every table (the low row-index bits; their eq coordinates are the last K entries of gkr_point — identical across tables since all eq factors are suffixes of it) are bound by one univariate round. The prover sends the combined polynomial v′(X) = Σ_t w_t·v′t(X), deg ≤ (2^K−1)·d_max = 150, as 151 coefficients. The verifier checks Σ{z∈D} ê(z)·v′(z) = Σ_t w_t·S_t where ê is the degree-15 interpolation of eq(gkr_top, bits(·)) over D — an exact identity — then samples r₀ ∈ EF and continues with target ê(r₀)·v′(r₀). Cheating error for the round: ≤ deg(v′)/|EF| ≈ 150/2^154.6 ≈ 2^−147.4. Per-table claims stay α-separated exactly as on main (the air_alpha powers are disjoint per table); sending the combined polynomial is the same compression every other round already uses.
  3. Remaining rounds (n_max−K, degree ≤ 12): protocol-identical to main, strictly fewer rounds.
  4. Opening claims. Each table's columns are opened at the tensor point (Lagrange₁₆(r₀) on the last 4 coordinates) ⊗ eq(prefix challenges). WHIR's SparseStatement gains an optional tail: Vec<EF>; the weight is eq(point) ⊗ MLE(tail) — still a public, verifier-evaluable multilinear weight, i.e. exactly the statement class WHIR already proves (constrained-RS codes with arbitrary public weights; Arnon–Chiesa–Fenzi–Yogev 2024/1586). tail = None paths are bit-identical to main (enforced by behavioral-equivalence tests: a tailed statement with tail = eq-expansion of 4 challenges is byte-equivalent to the plain statement with those challenges appended). Shift("next") statements use the shift-by-one of the tail-seeded eq vector, validated against Σ_x tail[x]·next_mle identities.
  5. Fiat–Shamir. Skip coefficients are absorbed before r₀ is squeezed (same absorb→sample discipline as every round, same duplex sponge). Transcript grows ≈ +107 EF (+0.6 %); measured proof-size delta at the gate: 0.0 KiB at integer-KiB resolution.
  6. Recursion. The zkDSL circuit verifies the identical protocol (compiled from the same identities; Lagrange weights via compile-time-inverted denominators — no in-circuit inversions anywhere). Bytecode +1.65 %, log_size 18 unchanged; aggregation-node cycles went down (−614) because one grinding event replaces four and three sumcheck rounds disappear.

Added soundness error, total: < 2^−146 against a 2^−124 budget — absorbed without moving the regime. The mutation fuzzer and the python tamper sweeps exercise the new verifier checks directly (window-identity rejection observed at the exact assertion).

Mechanism 2 (h4) — soundness in detail

There is no protocol surface: the FD cascade computes the same field elements the Lagrange dots computed (Newton forward differences on unit-spaced nodes; exact arithmetic in odd characteristic; the division-free advance is add-only). Equality is enforced, not assumed: test_fd_extension_matches_lagrange pins bit-identical full coefficient vectors across the generic, degree-split, and unpacked kernels, the padding path, and K ∈ {3,4,5} (10 configurations), and the entire iter-1 test battery (window-sum identities vs brute-force cube sums, post-skip state vs naive references, e2e roundtrips, production prove+verify) runs unmodified on the FD path. The independent review re-derived the anchor invariant (row_i = Δ^{n−1−i}p(i), ascending advance) and the degree-split lockstep node accounting. Proofs produced before/after the commit are byte-identical; the verifier is untouched.

What needs human audit (and what has already been independently checked)

An independent reviewer (Opus-class agent, read-level + hand-derivation) performed a soundness double-check of this branch. Verdict: no exploitable break found; diligence review, not a proof. The calibration below is theirs and we adopt it: the existing test suite demonstrates completeness (honest proofs verify) and single-tamper rejection (200/200 fuzzer, 5/5 python vectors, per-coefficient corruption tests) — neither implies soundness against a prover genuinely trying to cheat. The L0.7/L0.8 gates are tripwires ("protected files were touched"), not soundness checks. This change is not production-safe until a human cryptographer signs off.

Already independently verified

  • The core forgery reduction, re-derived from scratch: a cheating P = P_honest + Q must satisfy Σ_{z∈D} ê(z)·Q(z) = 0 (round-0 identity, exact) yet needs ê(r₀)·Q(r₀) = 0 to survive the continuation; deg(ê·Q) ≤ 165 ⇒ success ≤ 165/|EF| ≈ 2^−147.
  • Three-verifier agreement: Rust, the Python spec, and the recursion circuit enforce the same checks with the same constants — critically, the in-circuit (on-chain) verifier carries both load-bearing assertions (round-0 window identity at recursion.py:353; final-value equality at :423; copy_ef = write-once equality). A tighter-Rust/looser-circuit hole was the top concern and is ruled out.
  • Hard-constraint compliance, byte-level: poseidon1_koalabear_16.rs and lean_prover/src/lib.rs identical to baseline (SECURITY_BITS=124, GRINDING_BITS=16, WHIR folding 7/5, RS reduction 5, all round data); Python WHIR_CONFIGS unchanged. No silent parameter weakening — the single most damaging failure mode for an auto-optimizer, checked first.
  • The tensor-tail weight class: eq(point,·)·MLE(tail)(·) is degree-1 per variable, in WHIR's existing constraint class; the shift variant's reduction to shift-by-one of eval_eq_with_tail with wraparound was identity-checked.

Where the audit should aim (residual risk is narrow and localized):

  1. The next-with-tail (shift-column) weight across the skip blockmatrix_next_mle_folded_with_tail (backend/poly/src/next_mle.rs) and especially the recursion circuit's hand-factored closed form in zkdsl_implem/utils.py, corner term prod_corner·(prod_y_low − eq_low[0]). Status: verified by algebraic identity + honest-input agreement (test_aggregation) + 144/144 numerical cross-check against the naive sum — which establishes the honest value is right, not that no wrong shift value can satisfy the check. That property needs independent re-derivation.
  2. Adversarial soundness tests are missing — recommended pre-audit addition: (a) an invalid trace (AIR-violating witness) plus a best-effort forged combined polynomial P, confirming rejection through the full pipeline; (b) a wrong-shift-value rejection test targeting the tail corner specifically. These convert "no break found by reading" into "these specific break attempts are mechanically rejected."
  3. The front-loading bookkeeping (w_t weights, finished-table halving) and the FD kernel have been re-derived twice each in per-commit reviews; lower priority but the derivations are one-page checks against plan_spec.md if desired.

Discarded mechanisms (investigated, killed or reverted — no residue on the branch)

h8 — univariate skip for WHIR's initial folding sumcheck (implemented fully, discarded at the gate, reverted cleanly)

Applied the same skip to the first 4 of WHIR's 7 initial folding rounds, with code-domain semantics as a 16-ary shard fold (same Merkle leaves) and soundness via BCIKS 2020/654 Thm 1.5/7.2 — correlated agreement for degree-15 parameterized curves, the same conjecture class the deployed gamma-power statement batching already assumes at degree ≈ 190 (security write-up: report/hypothesis_8/security_regime.md). All 5 commits passed independent review; recursion cycles below baseline; mutation fuzzer clean. Discarded on measurement: −1.23 % throughput (vs −5.5 predicted) and +4 KiB proof → net −1.21 % after the 3× size penalty. Root cause (concept, not implementation): compute savings in the memory-bandwidth-bound WHIR region convert to wall-clock at only ~0.4× (the AIR region converts ~1:1), and transcript deltas multiply per WHIR instance in aggregate bundles. Reverted as 00032320..ff7eeb2d; full post-mortem report/iter3_summary.md.

h2 — GKR offload of the Poseidon permutation (killed at planning)

Removing the 84 intermediate Poseidon columns in favor of a 28-layer data-parallel GKR would add +29.5–40 KiB of transcript (28 constant-width 2^22 layers × ~71 EF each) against a +5 KiB budget — structurally irreducible (fusing layers grows bytes as 3^k; round-stacking would require committing the very intermediates being removed). Commit cost would not drop (ν stays 26: 60.2 M cells → 38 M, boundary at 33.5 M). report/hypothesis_2/gkr-poseidon-offload.md.

h3 — vectorized (block-4) memory bus for Poseidon I/O (killed at audit)

Block lookups require 4-aligned operand pointers; a trace audit of the production workload measured 99 % / 67 % / 60 % / 91 % of the four Poseidon operand classes unaligned (the guest allocator is a raw bump pointer; the preamble alone is ≡ 1 mod 4). An alignment retrofit is an allocation-policy change (out of scope) with a 2^22 memory-boundary downside. The protocol side (block lane coexistence, tensor fingerprints, domain separation) was verified feasible and is documented for any future ISA that guarantees alignment. report/hypothesis_3/vectorized-memory-bus.md.

h9 — power-circuits rewiring of the LogUp fraction tree (killed at in-pipeline measurement, zero protocol surface)

Soukhanov 2023/1611's multiplication-via-squares wiring for the GKR quotient tree. Killed in ~3 hours by the in-pipeline kill-gate: the squaring kernel measured 2.55× slower than production inside the real prove (packed quintic squaring costs ~1.7–1.8× of a mul on Zen 4 — ILP-bound; the 0.6× literature assumption does not hold), plus three pre-registered code-fact refutations (the GKR transcript already sends Gruen-elided 2 EF/round; the eq machinery already has Gruen + split-eq; the 7/4 state growth crosses the logup 2^25 boundary). Permanent diagnostic: per-layer GKR cost attribution. report/hypothesis_9/killgate.md.

Triage-killed without implementation
  • h5 (bottom-GKR-layer skip): bottom-layer sumcheck measured 70–78 ms, but denominators are EF at every round, and the 0.4× region conversion prices it ≈ −0.6 % — below the gate floor.
  • h6 (Poseidon degree-10→9 output regating): the pre-wiring soundness sketch showed the "small regating" actually requires mode-dependent pull multiplicities (logup-wide change) for ≈ −0.8 % post-h1/h4 — below the gate floor. Mechanism documented as sound for future workload scales; report/hypothesis_6/soundness_sketch.md.
  • h7 (parallelize acc builds): −0.8 % predicted < 1 % gate threshold by construction.
  • Memory-argument replacements (Twist/Shout, Lasso, Spice): economics assume MSM commitments (zeros free); under hash-based WHIR the sparse machinery costs orders of magnitude more hashed cells than the entire current bus. report/papers/iter_4/notes.md.

Reviewer reproduction quick-start

# correctness battery (expect L0.7/L0.8 to flag this PR's protected-file diffs — that's this review)
bash harness/leanvm/correctness/correctness.sh

# performance, main vs branch tip
bash harness/leanvm/scripts/eval_paired.sh --baseline 191e1a5b --candidate ff7eeb2d --n 3

# python verifier (executable spec) on freshly dumped vectors
cargo test --release -p lean_prover --lib -- test_zkvm::dump_test_vectors_for_python_verifier --include-ignored
python3 crates/lean_prover/python-verifier/test_verify.py

# the pre-wiring kill-gate benches
cargo test -p sub_protocols --release --test air_sumcheck_skip_kernel -- --ignored --nocapture

Protected-file diff summary (the human-review surface)

File Δ Nature
crates/lean_prover/src/verify_execution.rs +44/−40 uniskip verifier call, pinned final-check formula, tailed claims; every original check preserved (bounds, statement-count assert, full-consumption)
crates/whir/src/open.rs +15/−8 tail dispatch in combine_statement; fast paths excluded for tailed statements
crates/whir/src/verify.rs +14/−4 tail factor in weight evaluation; tail=None arms expression-identical to main
crates/sub_protocols/src/air_sumcheck.rs 13 lines visibility-only (pub(crate)); zero logic change; legacy entry points preserved verbatim

Everything else (fiat-shamir, koala-bear/Poseidon, WHIR configs, stack_polynomials_and_commit, structural AIR methods) is zero-diff — enforced by the Layer 0.5/0.7 guards and verifiable by git diff main...ff7eeb2d --stat.

Limitations & follow-ups

  • Measured on two machine class (Hetzner AX42U, Zen 4 AVX-512, 16c + M4 Mac Mini 32GB Scaleway); the skip kernel's win is arithmetic-bound transfers, but the FD/Lagrange ratio is ILP-sensitive.
  • The K = 4 window is compile-time (UNIVARIATE_SKIP_K, single source Rust→Python→zkDSL); a K ∈ {3,5} sweep was analyzed (K = 4 near-optimal with FD; K = 5 is a net loss) but not gate-measured.
  • The two harness timing methods differ between iteration gates (documented above); the table at the top is one self-consistent paired run.
  • Region conversion factors (AIR ≈ 1:1, WHIR/GKR ≈ 0.4) and per-layer GKR attribution are recorded in the experiment logs for future optimization pricing.

Barnadrot and others added 9 commits June 11, 2026 11:06
…no-go timing gate)

Skip-round prover kernel (Gruen 2024/108 §5) on AirSumcheckSession:
compute_skip_poly (generic + degree-split packed kernels, scalar fallback
for tables with n-K <= packing width) and process_skip_challenge
(Lagrange 2^K->1 block fold + legacy-state fast-forward). Window nodes
are read directly from the chunk-bit-reversed storage (block of cube
point x = bitreverse_K(x)); extended integer targets via base-field
Lagrange dots. Poseidon's low_degree_air path generalizes the existing
degree-split: post-block state interpolated from the 2^K window states,
low part from the first low_degree*(2^K-1)+1 nodes.

GO/NO-GO timing gate (2^18x110 poseidon + 2^20x22 execution, K=4): PASS
  poseidon16: skip 183.09 ms vs legacy rounds 0..3 311.62 ms
  execution : skip  51.30 ms vs legacy rounds 0..3  91.17 ms
  combined  : 234.39 ms vs 402.80 ms

air_sumcheck.rs: visibility-only changes (protected file, see plan_spec):
10 fields + pivot/in_phase_1/permuted_alphas to pub(crate). No logic.

7 new correctness tests (per-node identities vs brute force, extended-node
spot checks, post-skip sum/bare-poly/final-eval equivalence, padding,
shift columns, unpacked fallback, block map); existing suites untouched
and passing; clippy -D warnings clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rove+verify)

Protocol (plan_spec.md): all tables join at round 0 with static weights
w_t = 2^(n_max - n_t); round 0 sends ONE combined skip polynomial
P = sum_t w_t * v'_t IN FULL (the round-0 identity is the weighted window
sum, not h(0)+h(1), so no coefficient is elided); finished tables
contribute constant c_t * 2^(m-1) folded into coeffs[0] so the verifier's
c0 reconstruction is untouched; final combined value = sum_t s_t^final
with NO challenge products.

Verifier half reuses sumcheck_verify verbatim for the linear rounds.
Final identity (pinned by the e2e test, installed in verify_execution by T5):
  final_target == sum_t e_hat(r0) * eq(eq_factor_t[..n_t-K], natural_prefix_t)
                        * C_t(col_evals_t),
  natural_prefix_t = reverse(linear_challenges[..n_t-K]).

e2e test: 3 real tables (execution n=13 with shift cols + analytic padding,
extension_op n=9, poseidon16 n=10 degree-split), brute-forced claims,
full prove->verify roundtrip at K in {3,4}; adversarial: tampered skip
coefficient rejected at the round-0 window identity, tampered linear-round
coefficient rejected at the final identity (c0 elision absorbs per-round
wire tampering by construction).

No protected files touched; legacy prove_batched_air_sumcheck verbatim.
…skip openings)

SparseStatement gains optional tail: Option<Vec<EF>> — inner weight becomes
eq(point, hi) * MLE(tail)(lo) on the lowest log2(tail.len()) inner variables
(is_next: its shift-by-one). Constructors new_with_tail/new_next_with_tail;
inner_num_variables() includes the tail width; tail=None behavior unchanged
(enforced by point-append equivalence tests + all existing suites).

whir/open.rs, whir/verify.rs: protected files — surgical diffs (fast-path
guard + general-branch dispatch in combine_statement; weight-eval tail factor
in eval_constraints_poly + defensive debug_assert in verify_constraint_coeffs),
see plan_spec invariant 6. open.rs +15 lines, verify.rs +14 lines.

Tests (crates/whir/tests/tensor_tail.rs, n=18 production-shaped config):
tail==point-append equivalence (eq + next), random-tail naive-cube-sum
roundtrip + corruption rejection, mixed statements (dense fast path + plain
sparse + tailed eq k=4 + tailed next k=3) in one prove.
CommittedStatements entries become CommittedClaim {point, tail, eq_values,
next_values}; AIR opening claims carry the Lagrange-window tensor tail
(weight = eq(natural_prefix) x L(r0)); logup claims are tail-less.
Prover calls prove_batched_air_sumcheck_uniskip (K=4); verifier checks the
round-0 window identity then the pinned final formula
  target == sum_t e_hat(r0) * eq(bus_point[..n_t-K], natural_prefix_t) * C_t(col_evals_t)
and back_loaded_table_contribution is removed (front-loading has no
challenge-product k_t). stacked_pcs threads tails into
SparseStatement::new_with_tail / new_next_with_tail; statement COUNT
unchanged (total_whir_statements assert still passes).

verify_execution.rs: protected file - surgical protocol-spec diff (uniskip
verifier call + pinned final-check formula + tailed claims), see plan_spec
invariant 6 and report/hypothesis_1/security_regime.md.

Tests: lean_prover 4/4 (incl. min-size tables), sub_protocols/whir/lean_vm
suites green, workspace clippy -D warnings clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mirrors the Rust uniskip AIR sumcheck (T1-T5) in the executable spec:
- SKIP_K = 4 (must equal sub_protocols::UNIVARIATE_SKIP_K)
- verify_air_sumcheck_with_skip: combined coefficient vector (length
  (2^K-1)*d_max+1) read before sampling r0; round-0 identity
  sum_{z in D} e_hat(z)*P(z) == sum_t 2^(n_max-n_t)*s_t over the integer
  window D = {0..2^K-1}; target = e_hat(r0)*P(r0); remaining n_max-K
  rounds via the unchanged verify_sumcheck
- SparseStatements.tail (eq(point) tensor MLE(tail) on the lowest
  log2(len(tail)) inner variables) + tailed weight evaluation in
  verify_whir (next_mle_with_tail for shifted statements)
- AIR final identity: e_hat(r0) * eq(gkr suffix prefix, natural_prefix)
  * C_t(col_evals); AIR claims carry the Lagrange-weight tail

Validation: all 5 re-dumped vectors verify OK; transcript tamper sweep
rejects at every position (skip-region tamper fires 'AIR skip round:
weighted window identity failed').
…pilation)

In-circuit mirror of the uniskip AIR sumcheck (executable spec: python-verifier
verify_air_sumcheck_with_skip, T5 Rust verifier):
- front-loaded per-table weights w_t = 2^(n_max-n_t) on the claimed sum
- round 0: combined coefficient vector (151 EF) read before r0; window evals
  P(z) via compile-time z-power tables + dot_product_be; weighted window
  identity vs Sum_t w_t*s_t; Lagrange weights L_x(r0) with build-time-inverted
  constant denominators (no in-circuit inversion); target = e_hat(r0)*P(r0)
- linear rounds: sumcheck_verify_reversed(n_max - SKIP_K); natural prefix =
  contiguous slice all_challenges[n_linear-(log_h-K)..] (reversed-storage)
- final identity: e_hat(r0) * eq(bus_point[..log_h-K], natural_prefix) * C_t;
  k_t challenge products removed
- WHIR AIR-claim weights: eq(prefix) x MLE(Lagrange tail) on the lowest SKIP_K
  inner coords; shifted claims via next_mle_with_tail (closed form = scalar
  evaluation of poly::matrix_next_mle_folded_with_tail, documented in utils.py)
- compilation.rs placeholders: SKIP_K, SKIP_Z_POWERS, SKIP_LAGRANGE_C (sourced
  from sub_protocols::UNIVARIATE_SKIP_K + build-time KoalaBear arithmetic)

Bytecode: 213,227 instructions (was 209,762, +1.65%), log_size 18 unchanged,
self-referential compile converged on first guess. Recursion bench (--n 2,
rate 2): total 3.344s vs 3.503s baseline (-4.5%); aggregation node 219,433
cycles (was 213,207, +2.9% = ~3.1k cycles per verified proof); leaf proofs
-4.3%/-6.1% prove time; proof sizes +2-3 KiB (the 151-EF skip vector).
Validation: test_recursive_aggregation_extension_table_bigger_than_execution
passes; all package suites green; python verifier 5/5 vectors.
Replaces the Lagrange-coefficient dot products that extend each column's
2^K window values to the extended consecutive-integer nodes with Newton
forward-difference cascades, at all three interpolation sites of the
skip kernels: column values (in place on the gathered window buffer),
the degree-split cached state (lockstep cascade), and the low-part
accumulator (width-1 cascade over its n_low values). (2^K-1) packed adds
per column per node instead of 2^K packed Montgomery muls; exact field
arithmetic, outputs bit-identical to the Lagrange path (pinned by
test_fd_extension_matches_lagrange across all kernels, padding, and
K in {3,4,5}, plus the unchanged iter-1 suites).

Lagrange kernels kept as private _lagrange reference functions behind
the doc(hidden) compute_skip_poly_forced test hook. Prover-only; zero
protocol/transcript change; no protected files touched.

h4 kill-gate bench (this machine, --ignored skip_kernel_timing):
  poseidon16 2^18x110: skip-poly FD 103.4 ms vs Lagrange 170.3 ms
  execution  2^20x22 : skip-poly FD  22.5 ms vs Lagrange  42.1 ms
  h4 gate: FD 125.9 ms vs Lagrange 212.4 ms -> PASS (-40.7%)
  full skip path 150.2 ms vs legacy rounds 408.2 ms -> PASS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant