Univariate skip for the batched AIR sumcheck (+ finite-difference kernel)#251
Open
Barnadrot wants to merge 9 commits into
Open
Univariate skip for the batched AIR sumcheck (+ finite-difference kernel)#251Barnadrot wants to merge 9 commits into
Barnadrot wants to merge 9 commits into
Conversation
…r-tail eq/next expansion)
…no-go timing gate) Skip-round prover kernel (Gruen 2024/108 §5) on AirSumcheckSession: compute_skip_poly (generic + degree-split packed kernels, scalar fallback for tables with n-K <= packing width) and process_skip_challenge (Lagrange 2^K->1 block fold + legacy-state fast-forward). Window nodes are read directly from the chunk-bit-reversed storage (block of cube point x = bitreverse_K(x)); extended integer targets via base-field Lagrange dots. Poseidon's low_degree_air path generalizes the existing degree-split: post-block state interpolated from the 2^K window states, low part from the first low_degree*(2^K-1)+1 nodes. GO/NO-GO timing gate (2^18x110 poseidon + 2^20x22 execution, K=4): PASS poseidon16: skip 183.09 ms vs legacy rounds 0..3 311.62 ms execution : skip 51.30 ms vs legacy rounds 0..3 91.17 ms combined : 234.39 ms vs 402.80 ms air_sumcheck.rs: visibility-only changes (protected file, see plan_spec): 10 fields + pivot/in_phase_1/permuted_alphas to pub(crate). No logic. 7 new correctness tests (per-node identities vs brute force, extended-node spot checks, post-skip sum/bare-poly/final-eval equivalence, padding, shift columns, unpacked fallback, block map); existing suites untouched and passing; clippy -D warnings clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rove+verify)
Protocol (plan_spec.md): all tables join at round 0 with static weights
w_t = 2^(n_max - n_t); round 0 sends ONE combined skip polynomial
P = sum_t w_t * v'_t IN FULL (the round-0 identity is the weighted window
sum, not h(0)+h(1), so no coefficient is elided); finished tables
contribute constant c_t * 2^(m-1) folded into coeffs[0] so the verifier's
c0 reconstruction is untouched; final combined value = sum_t s_t^final
with NO challenge products.
Verifier half reuses sumcheck_verify verbatim for the linear rounds.
Final identity (pinned by the e2e test, installed in verify_execution by T5):
final_target == sum_t e_hat(r0) * eq(eq_factor_t[..n_t-K], natural_prefix_t)
* C_t(col_evals_t),
natural_prefix_t = reverse(linear_challenges[..n_t-K]).
e2e test: 3 real tables (execution n=13 with shift cols + analytic padding,
extension_op n=9, poseidon16 n=10 degree-split), brute-forced claims,
full prove->verify roundtrip at K in {3,4}; adversarial: tampered skip
coefficient rejected at the round-0 window identity, tampered linear-round
coefficient rejected at the final identity (c0 elision absorbs per-round
wire tampering by construction).
No protected files touched; legacy prove_batched_air_sumcheck verbatim.
…skip openings) SparseStatement gains optional tail: Option<Vec<EF>> — inner weight becomes eq(point, hi) * MLE(tail)(lo) on the lowest log2(tail.len()) inner variables (is_next: its shift-by-one). Constructors new_with_tail/new_next_with_tail; inner_num_variables() includes the tail width; tail=None behavior unchanged (enforced by point-append equivalence tests + all existing suites). whir/open.rs, whir/verify.rs: protected files — surgical diffs (fast-path guard + general-branch dispatch in combine_statement; weight-eval tail factor in eval_constraints_poly + defensive debug_assert in verify_constraint_coeffs), see plan_spec invariant 6. open.rs +15 lines, verify.rs +14 lines. Tests (crates/whir/tests/tensor_tail.rs, n=18 production-shaped config): tail==point-append equivalence (eq + next), random-tail naive-cube-sum roundtrip + corruption rejection, mixed statements (dense fast path + plain sparse + tailed eq k=4 + tailed next k=3) in one prove.
CommittedStatements entries become CommittedClaim {point, tail, eq_values,
next_values}; AIR opening claims carry the Lagrange-window tensor tail
(weight = eq(natural_prefix) x L(r0)); logup claims are tail-less.
Prover calls prove_batched_air_sumcheck_uniskip (K=4); verifier checks the
round-0 window identity then the pinned final formula
target == sum_t e_hat(r0) * eq(bus_point[..n_t-K], natural_prefix_t) * C_t(col_evals_t)
and back_loaded_table_contribution is removed (front-loading has no
challenge-product k_t). stacked_pcs threads tails into
SparseStatement::new_with_tail / new_next_with_tail; statement COUNT
unchanged (total_whir_statements assert still passes).
verify_execution.rs: protected file - surgical protocol-spec diff (uniskip
verifier call + pinned final-check formula + tailed claims), see plan_spec
invariant 6 and report/hypothesis_1/security_regime.md.
Tests: lean_prover 4/4 (incl. min-size tables), sub_protocols/whir/lean_vm
suites green, workspace clippy -D warnings clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mirrors the Rust uniskip AIR sumcheck (T1-T5) in the executable spec:
- SKIP_K = 4 (must equal sub_protocols::UNIVARIATE_SKIP_K)
- verify_air_sumcheck_with_skip: combined coefficient vector (length
(2^K-1)*d_max+1) read before sampling r0; round-0 identity
sum_{z in D} e_hat(z)*P(z) == sum_t 2^(n_max-n_t)*s_t over the integer
window D = {0..2^K-1}; target = e_hat(r0)*P(r0); remaining n_max-K
rounds via the unchanged verify_sumcheck
- SparseStatements.tail (eq(point) tensor MLE(tail) on the lowest
log2(len(tail)) inner variables) + tailed weight evaluation in
verify_whir (next_mle_with_tail for shifted statements)
- AIR final identity: e_hat(r0) * eq(gkr suffix prefix, natural_prefix)
* C_t(col_evals); AIR claims carry the Lagrange-weight tail
Validation: all 5 re-dumped vectors verify OK; transcript tamper sweep
rejects at every position (skip-region tamper fires 'AIR skip round:
weighted window identity failed').
…pilation) In-circuit mirror of the uniskip AIR sumcheck (executable spec: python-verifier verify_air_sumcheck_with_skip, T5 Rust verifier): - front-loaded per-table weights w_t = 2^(n_max-n_t) on the claimed sum - round 0: combined coefficient vector (151 EF) read before r0; window evals P(z) via compile-time z-power tables + dot_product_be; weighted window identity vs Sum_t w_t*s_t; Lagrange weights L_x(r0) with build-time-inverted constant denominators (no in-circuit inversion); target = e_hat(r0)*P(r0) - linear rounds: sumcheck_verify_reversed(n_max - SKIP_K); natural prefix = contiguous slice all_challenges[n_linear-(log_h-K)..] (reversed-storage) - final identity: e_hat(r0) * eq(bus_point[..log_h-K], natural_prefix) * C_t; k_t challenge products removed - WHIR AIR-claim weights: eq(prefix) x MLE(Lagrange tail) on the lowest SKIP_K inner coords; shifted claims via next_mle_with_tail (closed form = scalar evaluation of poly::matrix_next_mle_folded_with_tail, documented in utils.py) - compilation.rs placeholders: SKIP_K, SKIP_Z_POWERS, SKIP_LAGRANGE_C (sourced from sub_protocols::UNIVARIATE_SKIP_K + build-time KoalaBear arithmetic) Bytecode: 213,227 instructions (was 209,762, +1.65%), log_size 18 unchanged, self-referential compile converged on first guess. Recursion bench (--n 2, rate 2): total 3.344s vs 3.503s baseline (-4.5%); aggregation node 219,433 cycles (was 213,207, +2.9% = ~3.1k cycles per verified proof); leaf proofs -4.3%/-6.1% prove time; proof sizes +2-3 KiB (the 151-EF skip vector). Validation: test_recursive_aggregation_extension_table_bigger_than_execution passes; all package suites green; python verifier 5/5 vectors.
Replaces the Lagrange-coefficient dot products that extend each column's
2^K window values to the extended consecutive-integer nodes with Newton
forward-difference cascades, at all three interpolation sites of the
skip kernels: column values (in place on the gathered window buffer),
the degree-split cached state (lockstep cascade), and the low-part
accumulator (width-1 cascade over its n_low values). (2^K-1) packed adds
per column per node instead of 2^K packed Montgomery muls; exact field
arithmetic, outputs bit-identical to the Lagrange path (pinned by
test_fd_extension_matches_lagrange across all kernels, padding, and
K in {3,4,5}, plus the unchanged iter-1 suites).
Lagrange kernels kept as private _lagrange reference functions behind
the doc(hidden) compute_skip_poly_forced test hook. Prover-only; zero
protocol/transcript change; no protected files touched.
h4 kill-gate bench (this machine, --ignored skip_kernel_timing):
poseidon16 2^18x110: skip-poly FD 103.4 ms vs Lagrange 170.3 ms
execution 2^20x22 : skip-poly FD 22.5 ms vs Lagrange 42.1 ms
h4 gate: FD 125.9 ms vs Lagrange 212.4 ms -> PASS (-40.7%)
full skip path 150.2 ms vs legacy rounds 408.2 ms -> PASS
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Univariate skip for the batched AIR sumcheck (+ finite-difference kernel)
Branch
pw13-cleanTwo kept mechanisms, both targeting the AIR-constraint-evaluation cost center identified by profiling (the batched AIR sumcheck was 416 ms = 20.7 % of a 2.01 s prove on the reference workload).Introduction — the mechanisms in one paragraph each
Mechanism 1 — univariate skip for the batched AIR sumcheck (h1, commits T1–T7). The AIR zerocheck paid extension-field constraint evaluations from round 1 onward (measured: rounds ≥ 1 = 83 % of the sumcheck's cost; EF mult ≈ 15× base). We replace the first K = 4 rounds with ONE univariate round over the 16-point integer window {0..15} (Gruen, eprint 2024/108 §5): the prover evaluates all tables' constraints at 31 nodes entirely in the base field with AVX-512 packing, sends a single combined coefficient vector, and binds 4 variables with one challenge. Table batching switches from back-loaded to front-loaded (static weights w_t = 2^(n_max−n_t); no challenge products), and the WHIR opening claims become tensor-weight statements (Lagrange₁₆(r₀) ⊗ eq) via a new optional
tailonSparseStatement. The extension-field round tail shrinks 16×.Mechanism 2 — finite-difference extension kernel (h4, commit U1). Inside the skip round, extending each column's 16 window values to the 135 extended integer nodes was done with per-node Lagrange dots (16 packed Montgomery muls per column per node). Newton forward-difference cascades replace this with 15 packed adds per column per node (exact field arithmetic — proofs are bit-identical, this is kernel-only). Applied at all three interpolation sites (columns, degree-split cached state, low-evals accumulator).
Results vs
main(191e1a5)recursion --n 2 --log-inv-rate 2, total)Per-iteration gate verdicts when each mechanism landed (different intermediate baselines): h1 −6.66 % (p ≈ 0, steady-state −5.68 %, recursion −4.08 %; pre-fix harness timing method), h4 −3.10 % (p = 2.9e-05, recursion −3.25 %, proofs bit-identical; current method). The table above is the direct A–B comparison and is the defensible headline.
Measurements on M4 Mac Mini 32 GB
Validation methods used by the harness (for reviewers)
harness/leanvm/correctness/correctness.sh): Layer 0 test-file integrity (SHA-pinned), 0.5 crypto-parameter guard, 1 clippy-D warnings, 2 field/backend tests, 3 structural invariants (column counts, bus widths, logup overflow, WHIR config Rust↔Python consistency), 4 WHIR roundtrips, 5 aggregation e2e (test_aggregation: real XMSS proofs + in-circuit verification), 6 free-variable soundness (soundness_check: every committed/virtual Poseidon column constrained), 7 transcript-mutation fuzzer (200 random mutations → all must be rejected; run at 3 seeds across the iterations, 0 accepted total). Layers 0.7/0.8 (protected-file diff hygiene + frozen-main differential verification) fail by design for any transcript-shape change — this PR is the human-review path those layers prescribe.crates/lean_prover/python-verifier/): re-dumped vectors (5 shapes spanning table-height mixes and rates 1–4) all verify; independent tamper sweeps (92 + 7 positions) all rejected, each at the expected assertion.harness/leanvm/scripts/eval_paired.sh): env preflight (governor/load), A-vs-A noise floor (≤ 0.3 %), N=3 counterbalanced rounds × 4 warm proofs, Welch's t-test (keep iff Δ ≤ −1 %, p < 0.01), recursion-regression ceiling (+3 %), proof-size ceiling (+20 %) with 3× net penalty; auto-chained 250-proof steady-state confirmation.plan_spec.md), including full re-derivations of the front-loading bookkeeping, the FD anchor invariant, the closed-formnext_mle_with_tail(numerically confirmed 144/144), and the leaf/tensor index conventions.Mechanism 1 (h1) — soundness in detail
Security regime (unchanged): 124 bits, Johnson bound. |EF| = p⁵ ≈ 2^154.6 (KoalaBear quintic, X⁵+X²−1). WHIR parameters, RS domains, query counts, grinding bits, Merkle arity,
WHIR_CONFIGS, and the Poseidon permutation are bit-identical to main. Full write-up:report/hypothesis_1/security_regime.md.What changes, protocol-wise. The batched AIR sumcheck proves Σ_t Σ_{x∈H_{n_t}} eq(gkr_pt_t, x)·C_t(cols_t(x)) = Σ_t S_t, where S_t are the per-table bus claims the verifier computes itself from the LogUp-GKR outputs (verify_execution.rs:113–132, unchanged arithmetic).
back_loaded_table_contribution). Soundness argument is identical to the back-loaded scheme — the per-table multiplier is a monomial in the unused variables; per-round Schwartz–Zippel error unchanged at ≤ 12/|EF|.gkr_point— identical across tables since all eq factors are suffixes of it) are bound by one univariate round. The prover sends the combined polynomial v′(X) = Σ_t w_t·v′t(X), deg ≤ (2^K−1)·d_max = 150, as 151 coefficients. The verifier checks Σ{z∈D} ê(z)·v′(z) = Σ_t w_t·S_t where ê is the degree-15 interpolation of eq(gkr_top, bits(·)) over D — an exact identity — then samples r₀ ∈ EF and continues with target ê(r₀)·v′(r₀). Cheating error for the round: ≤ deg(v′)/|EF| ≈ 150/2^154.6 ≈ 2^−147.4. Per-table claims stay α-separated exactly as on main (the air_alpha powers are disjoint per table); sending the combined polynomial is the same compression every other round already uses.SparseStatementgains an optionaltail: Vec<EF>; the weight is eq(point) ⊗ MLE(tail) — still a public, verifier-evaluable multilinear weight, i.e. exactly the statement class WHIR already proves (constrained-RS codes with arbitrary public weights; Arnon–Chiesa–Fenzi–Yogev 2024/1586).tail = Nonepaths are bit-identical to main (enforced by behavioral-equivalence tests: a tailed statement with tail = eq-expansion of 4 challenges is byte-equivalent to the plain statement with those challenges appended). Shift("next") statements use the shift-by-one of the tail-seeded eq vector, validated against Σ_x tail[x]·next_mle identities.log_size18 unchanged; aggregation-node cycles went down (−614) because one grinding event replaces four and three sumcheck rounds disappear.Added soundness error, total: < 2^−146 against a 2^−124 budget — absorbed without moving the regime. The mutation fuzzer and the python tamper sweeps exercise the new verifier checks directly (window-identity rejection observed at the exact assertion).
Mechanism 2 (h4) — soundness in detail
There is no protocol surface: the FD cascade computes the same field elements the Lagrange dots computed (Newton forward differences on unit-spaced nodes; exact arithmetic in odd characteristic; the division-free advance is add-only). Equality is enforced, not assumed:
test_fd_extension_matches_lagrangepins bit-identical full coefficient vectors across the generic, degree-split, and unpacked kernels, the padding path, and K ∈ {3,4,5} (10 configurations), and the entire iter-1 test battery (window-sum identities vs brute-force cube sums, post-skip state vs naive references, e2e roundtrips, production prove+verify) runs unmodified on the FD path. The independent review re-derived the anchor invariant (row_i = Δ^{n−1−i}p(i), ascending advance) and the degree-split lockstep node accounting. Proofs produced before/after the commit are byte-identical; the verifier is untouched.What needs human audit (and what has already been independently checked)
An independent reviewer (Opus-class agent, read-level + hand-derivation) performed a soundness double-check of this branch. Verdict: no exploitable break found; diligence review, not a proof. The calibration below is theirs and we adopt it: the existing test suite demonstrates completeness (honest proofs verify) and single-tamper rejection (200/200 fuzzer, 5/5 python vectors, per-coefficient corruption tests) — neither implies soundness against a prover genuinely trying to cheat. The L0.7/L0.8 gates are tripwires ("protected files were touched"), not soundness checks. This change is not production-safe until a human cryptographer signs off.
Already independently verified
copy_ef= write-once equality). A tighter-Rust/looser-circuit hole was the top concern and is ruled out.poseidon1_koalabear_16.rsandlean_prover/src/lib.rsidentical to baseline (SECURITY_BITS=124, GRINDING_BITS=16, WHIR folding 7/5, RS reduction 5, all round data); PythonWHIR_CONFIGSunchanged. No silent parameter weakening — the single most damaging failure mode for an auto-optimizer, checked first.eval_eq_with_tailwith wraparound was identity-checked.Where the audit should aim (residual risk is narrow and localized):
matrix_next_mle_folded_with_tail(backend/poly/src/next_mle.rs) and especially the recursion circuit's hand-factored closed form inzkdsl_implem/utils.py, corner termprod_corner·(prod_y_low − eq_low[0]). Status: verified by algebraic identity + honest-input agreement (test_aggregation) + 144/144 numerical cross-check against the naive sum — which establishes the honest value is right, not that no wrong shift value can satisfy the check. That property needs independent re-derivation.plan_spec.mdif desired.Discarded mechanisms (investigated, killed or reverted — no residue on the branch)
h8 — univariate skip for WHIR's initial folding sumcheck (implemented fully, discarded at the gate, reverted cleanly)
Applied the same skip to the first 4 of WHIR's 7 initial folding rounds, with code-domain semantics as a 16-ary shard fold (same Merkle leaves) and soundness via BCIKS 2020/654 Thm 1.5/7.2 — correlated agreement for degree-15 parameterized curves, the same conjecture class the deployed gamma-power statement batching already assumes at degree ≈ 190 (security write-up:
report/hypothesis_8/security_regime.md). All 5 commits passed independent review; recursion cycles below baseline; mutation fuzzer clean. Discarded on measurement: −1.23 % throughput (vs −5.5 predicted) and +4 KiB proof → net −1.21 % after the 3× size penalty. Root cause (concept, not implementation): compute savings in the memory-bandwidth-bound WHIR region convert to wall-clock at only ~0.4× (the AIR region converts ~1:1), and transcript deltas multiply per WHIR instance in aggregate bundles. Reverted as00032320..ff7eeb2d; full post-mortemreport/iter3_summary.md.h2 — GKR offload of the Poseidon permutation (killed at planning)
Removing the 84 intermediate Poseidon columns in favor of a 28-layer data-parallel GKR would add +29.5–40 KiB of transcript (28 constant-width 2^22 layers × ~71 EF each) against a +5 KiB budget — structurally irreducible (fusing layers grows bytes as 3^k; round-stacking would require committing the very intermediates being removed). Commit cost would not drop (ν stays 26: 60.2 M cells → 38 M, boundary at 33.5 M).
report/hypothesis_2/gkr-poseidon-offload.md.h3 — vectorized (block-4) memory bus for Poseidon I/O (killed at audit)
Block lookups require 4-aligned operand pointers; a trace audit of the production workload measured 99 % / 67 % / 60 % / 91 % of the four Poseidon operand classes unaligned (the guest allocator is a raw bump pointer; the preamble alone is ≡ 1 mod 4). An alignment retrofit is an allocation-policy change (out of scope) with a 2^22 memory-boundary downside. The protocol side (block lane coexistence, tensor fingerprints, domain separation) was verified feasible and is documented for any future ISA that guarantees alignment.
report/hypothesis_3/vectorized-memory-bus.md.h9 — power-circuits rewiring of the LogUp fraction tree (killed at in-pipeline measurement, zero protocol surface)
Soukhanov 2023/1611's multiplication-via-squares wiring for the GKR quotient tree. Killed in ~3 hours by the in-pipeline kill-gate: the squaring kernel measured 2.55× slower than production inside the real prove (packed quintic squaring costs ~1.7–1.8× of a mul on Zen 4 — ILP-bound; the 0.6× literature assumption does not hold), plus three pre-registered code-fact refutations (the GKR transcript already sends Gruen-elided 2 EF/round; the eq machinery already has Gruen + split-eq; the 7/4 state growth crosses the logup 2^25 boundary). Permanent diagnostic: per-layer GKR cost attribution.
report/hypothesis_9/killgate.md.Triage-killed without implementation
report/hypothesis_6/soundness_sketch.md.report/papers/iter_4/notes.md.Reviewer reproduction quick-start
Protected-file diff summary (the human-review surface)
crates/lean_prover/src/verify_execution.rscrates/whir/src/open.rscrates/whir/src/verify.rstail=Nonearms expression-identical to maincrates/sub_protocols/src/air_sumcheck.rspub(crate)); zero logic change; legacy entry points preserved verbatimEverything else (fiat-shamir, koala-bear/Poseidon, WHIR configs,
stack_polynomials_and_commit, structural AIR methods) is zero-diff — enforced by the Layer 0.5/0.7 guards and verifiable bygit diff main...ff7eeb2d --stat.Limitations & follow-ups
UNIVARIATE_SKIP_K, single source Rust→Python→zkDSL); a K ∈ {3,5} sweep was analyzed (K = 4 near-optimal with FD; K = 5 is a net loss) but not gate-measured.