-
Notifications
You must be signed in to change notification settings - Fork 41
Open
Description
GPU Witness Generation
Overview
Accelerate witness generation by offloading computation from CPU to GPU.
The work is organized along two dimensions: chip category and functionality.
┌─────────────────┐
│ StepRecord │ (from emulator trace)
└────────┬────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌──────────────┐ ┌───────────────┐
│ F-1 Witness│ │ F-2 LK Mult. │ │ F-3 Shard Ctx │
│ Matrix │ │ (per-chip) │ │ Records │
└─────┬──────┘ └──────┬───────┘ └────────┬──────┘
│ │ │
│ ┌──────▼───────┐ ┌──────▼──────────┐
│ │ finalize lk │ │ ShardRamCircuit │
│ │multiplicities│ │ (C-5) │
│ └──────┬───────┘ └──────┬──────────┘
│ │ │
│ ┌──────▼───────┐ │
│ │Table Circuits│ │
│ │ (C-3) │ │
│ └──────┬───────┘ │
│ │ │
└────────────────┼───────────────────┘
▼
┌────────────────┐
│ Proof Gen │
└────────────────┘
Chip Categories
| ID | Category | Count | Description |
|---|---|---|---|
| C-1 | RV32IM Base Instructions | 45 | Core integer/memory/branch/jump instructions |
| C-2 | ECALL / Precompile | 17 | Keccak, SHA, elliptic curve, field ops, uint256 |
| C-3 | Table Circuits | 8 | Range, Ops (And/Or/Xor/Ltu/Pow), DoubleU8, Program |
| C-4 | RAM Init / Final Circuits | 7 | RegInit, StaticMemInit, PubIO, Hints/Stack/HeapInit, LocalFinal |
| C-5 | ShardRamCircuit | 1 | Cross-shard RAM consistency (374 witness cols, Poseidon2) |
Functionality Categories
| ID | Functionality | Description |
|---|---|---|
| F-1 | Fill Witness Matrix | Populate RowMajorMatrix<BB31> per chip (the main proof matrix) |
| F-2 | Lookup Multiplicity | Accumulate per-table lookup counters |
| F-3 | Shard Context Records | Cross-shard read/write RAM records consumed by ShardRamCircuit |
(Lookup Multiplicity 8 tables: Dynamic, DoubleU8, And/Or/Xor/Ltu/Pow, Instruction)
Progress Matrix
| F-1 Witness | F-2 Lookup | F-3 Shard | |
|---|---|---|---|
| C-1 RV32IM | ✅ | ✅ | ✅ |
| C-2 ECALL | ➡️ | ➡️ | ➡️ |
| C-3 Tables | |||
| C-4 RAM Init/Final | |||
| C-5 ShardRam | ➡️ |
Key dependency:
- F-2 results from all chips are merged by
finalize_lk_multiplicities()and consumed by C-3 (Table Circuits). - F-3 results are consumed by C-5 (ShardRamCircuit).
- C-4 (RAM Init/Final) consumes
MemFinalRecordfrom the trace directly (not from F-3).
PRs
- ceno-gpu: https://github.com/scroll-tech/ceno-gpu/pull/142
- ceno: feat: Make StepRecord use repr(C) layout #1260
- ceno: (draft) feat: GPU witness generation #1259
Current Status
- C-1 x F-1 — (RV32IM) All 45 instructions have GPU kernels producing witness matrices
- C-1 x F-2 — (RV32IM) GPU lookup multiplicity accumulation
- C-1 x F-3 — (RV32IM) GPU + lightweight CPU shard context record collection
- C-5 — ShardRamCircuit
- C-2 — Ecall_Keccak
C-1: RV32IM Base Instructions (45)
Grouped by GPU kernel (shared witness column layout).
| Instructions | Type | F-1: Witness | F-2: Lookup | F-3: Shard |
|---|---|---|---|---|
| ADD, SUB (2) | integer add/sub (R) | witgen_add/sub (22 cols) |
||
| AND, OR, XOR (3) | bitwise logic (R) | witgen_logic_r (28 cols) |
||
| SLT, SLTU (2) | set-less-than (R) | witgen_slt (26 cols) |
||
| SLL, SRL, SRA (3) | shift (R) | witgen_shift_r (47 cols) |
||
| MUL (1) | multiply low (R) | witgen_mul (22 cols) |
||
| MULH, MULHU, MULHSU (3) | multiply high (R) | witgen_mul (26 cols) |
||
| DIV, DIVU, REM, REMU (4) | div/rem (R) | witgen_div (39 cols) |
||
| ADDI (1) | immediate add (I) | witgen_addi (18 cols) |
||
| ANDI, ORI, XORI (3) | immediate logic (I) | witgen_logic_i (24 cols) |
||
| SLTI, SLTIU (2) | immediate compare (I) | witgen_slti (22 cols) |
||
| SLLI, SRLI, SRAI (3) | immediate shift (I) | witgen_shift_i (40 cols) |
||
| LUI (1) | load upper imm (U) | witgen_lui (16 cols) |
||
| AUIPC (1) | add upper imm to PC (U) | witgen_auipc (21 cols) |
||
| BEQ, BNE (2) | branch equal/neq (B) | witgen_branch_eq (19 cols) |
||
| BLT, BLTU, BGE, BGEU (4) | branch compare (B) | witgen_branch_cmp (22 cols) |
||
| JAL (1) | jump and link (J) | witgen_jal (13 cols) |
||
| JALR (1) | jump and link reg (I) | witgen_jalr (22 cols) |
||
| LW (1) | load word (I) | witgen_lw (23 cols) |
||
| LH, LHU, LB, LBU (4) | load sub-word (I) | witgen_load_sub (25-29 cols) |
||
| SW (1) | store word (S) | witgen_sw (23 cols) |
||
| SH (1) | store half (S) | witgen_sh (24 cols) |
||
| SB (1) | store byte (S) | witgen_sb (29 cols) |
||
| Total: 45 | 22 GPU kernels |
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels