Skip to content

GPU Witness Generation #1265

@Velaciela

Description

@Velaciela

GPU Witness Generation

Overview

Accelerate witness generation by offloading computation from CPU to GPU.
The work is organized along two dimensions: chip category and functionality.

                    ┌─────────────────┐
                    │   StepRecord    │  (from emulator trace)
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
     ┌────────────┐  ┌──────────────┐  ┌───────────────┐
     │ F-1 Witness│  │ F-2 LK Mult. │  │ F-3 Shard Ctx │
     │   Matrix   │  │  (per-chip)  │  │   Records     │
     └─────┬──────┘  └──────┬───────┘  └────────┬──────┘
           │                │                   │
           │         ┌──────▼───────┐    ┌──────▼──────────┐
           │         │ finalize lk  │    │ ShardRamCircuit │
           │         │multiplicities│    │      (C-5)      │
           │         └──────┬───────┘    └──────┬──────────┘
           │                │                   │
           │         ┌──────▼───────┐           │
           │         │Table Circuits│           │
           │         │    (C-3)     │           │
           │         └──────┬───────┘           │
           │                │                   │
           └────────────────┼───────────────────┘
                            ▼
                    ┌────────────────┐
                    │   Proof Gen    │
                    └────────────────┘

Chip Categories

ID Category Count Description
C-1 RV32IM Base Instructions 45 Core integer/memory/branch/jump instructions
C-2 ECALL / Precompile 17 Keccak, SHA, elliptic curve, field ops, uint256
C-3 Table Circuits 8 Range, Ops (And/Or/Xor/Ltu/Pow), DoubleU8, Program
C-4 RAM Init / Final Circuits 7 RegInit, StaticMemInit, PubIO, Hints/Stack/HeapInit, LocalFinal
C-5 ShardRamCircuit 1 Cross-shard RAM consistency (374 witness cols, Poseidon2)

Functionality Categories

ID Functionality Description
F-1 Fill Witness Matrix Populate RowMajorMatrix<BB31> per chip (the main proof matrix)
F-2 Lookup Multiplicity Accumulate per-table lookup counters
F-3 Shard Context Records Cross-shard read/write RAM records consumed by ShardRamCircuit

(Lookup Multiplicity 8 tables: Dynamic, DoubleU8, And/Or/Xor/Ltu/Pow, Instruction)

Progress Matrix

F-1 Witness F-2 Lookup F-3 Shard
C-1 RV32IM
C-2 ECALL ➡️ ➡️ ➡️
C-3 Tables
C-4 RAM Init/Final
C-5 ShardRam ➡️

Key dependency:

  • F-2 results from all chips are merged by finalize_lk_multiplicities() and consumed by C-3 (Table Circuits).
  • F-3 results are consumed by C-5 (ShardRamCircuit).
  • C-4 (RAM Init/Final) consumes MemFinalRecord from the trace directly (not from F-3).

PRs

Current Status

  • C-1 x F-1 — (RV32IM) All 45 instructions have GPU kernels producing witness matrices
  • C-1 x F-2 — (RV32IM) GPU lookup multiplicity accumulation
  • C-1 x F-3 — (RV32IM) GPU + lightweight CPU shard context record collection
  • C-5 — ShardRamCircuit
  • C-2 — Ecall_Keccak

C-1: RV32IM Base Instructions (45)

Grouped by GPU kernel (shared witness column layout).

Instructions Type F-1: Witness F-2: Lookup F-3: Shard
ADD, SUB (2) integer add/sub (R) witgen_add/sub (22 cols)
AND, OR, XOR (3) bitwise logic (R) witgen_logic_r (28 cols)
SLT, SLTU (2) set-less-than (R) witgen_slt (26 cols)
SLL, SRL, SRA (3) shift (R) witgen_shift_r (47 cols)
MUL (1) multiply low (R) witgen_mul (22 cols)
MULH, MULHU, MULHSU (3) multiply high (R) witgen_mul (26 cols)
DIV, DIVU, REM, REMU (4) div/rem (R) witgen_div (39 cols)
ADDI (1) immediate add (I) witgen_addi (18 cols)
ANDI, ORI, XORI (3) immediate logic (I) witgen_logic_i (24 cols)
SLTI, SLTIU (2) immediate compare (I) witgen_slti (22 cols)
SLLI, SRLI, SRAI (3) immediate shift (I) witgen_shift_i (40 cols)
LUI (1) load upper imm (U) witgen_lui (16 cols)
AUIPC (1) add upper imm to PC (U) witgen_auipc (21 cols)
BEQ, BNE (2) branch equal/neq (B) witgen_branch_eq (19 cols)
BLT, BLTU, BGE, BGEU (4) branch compare (B) witgen_branch_cmp (22 cols)
JAL (1) jump and link (J) witgen_jal (13 cols)
JALR (1) jump and link reg (I) witgen_jalr (22 cols)
LW (1) load word (I) witgen_lw (23 cols)
LH, LHU, LB, LBU (4) load sub-word (I) witgen_load_sub (25-29 cols)
SW (1) store word (S) witgen_sw (23 cols)
SH (1) store half (S) witgen_sh (24 cols)
SB (1) store byte (S) witgen_sb (29 cols)
Total: 45 22 GPU kernels

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions