Skip to content

feat: Make StepRecord use repr(C) layout#1260

Open
Velaciela wants to merge 2 commits intomasterfrom
feat/reprC-StepRecord
Open

feat: Make StepRecord use repr(C) layout#1260
Velaciela wants to merge 2 commits intomasterfrom
feat/reprC-StepRecord

Conversation

@Velaciela
Copy link
Collaborator

@Velaciela Velaciela commented Mar 4, 2026

Motivation

A fixed-size, Copy, repr(C) StepRecord is a prerequisite for GPU-accelerated witness generation. With a deterministic layout, millions of step records can be bulk-copied (H2D) to the GPU without any serialization. The Copy trait also reduces per-step clone overhead on the CPU side by turning clones into cheap bitwise copies.

Summary

  • Refactor StepRecord to implement Copy and use #[repr(C)] for deterministic memory layout, making it easy to bulk H2D copy to GPU for future CUDA witness generation
  • Narrow RegIdx type from usize to u8 for compactness (RISC-V has only 32 registers)
  • Extract SyscallWitness out of StepRecord into a separate store indexed by u32, since SyscallWitness contains Vec and cannot be Copy

  • before
pub struct StepRecord {
    cycle: Cycle,
    pc: Change<ByteAddr>,
    pub heap_maxtouch_addr: Change<ByteAddr>,
    pub hint_maxtouch_addr: Change<ByteAddr>,
    pub insn: Instruction,

    rs1: Option<ReadOp>,
    rs2: Option<ReadOp>,
    rd: Option<WriteOp>,

    memory_op: Option<WriteOp>,
    syscall: Option<SyscallWitness>,
}
  • after
#[repr(C)]
pub struct StepRecord {
    cycle: Cycle,
    pc: Change<ByteAddr>,
    pub heap_maxtouch_addr: Change<ByteAddr>,
    pub hint_maxtouch_addr: Change<ByteAddr>,
    pub insn: Instruction,

    has_rs1: bool,
    has_rs2: bool,
    has_rd: bool,
    has_memory_op: bool,

    rs1: ReadOp,
    rs2: ReadOp,
    rd: WriteOp,
    memory_op: WriteOp,

    /// Index into the separate syscall witness storage.
    /// `u32::MAX` means no syscall for this step.
    syscall_index: u32,
}

Changes

Core data structure changes (ceno_emul)

  1. StepRecord is now Copy + repr(C) (136 bytes, 4-byte aligned):

    • Replaced Option<ReadOp> / Option<WriteOp> fields with inline values + has_rs1 / has_rs2 / has_rd / has_memory_op boolean flags
    • Replaced Option<SyscallWitness> with syscall_index: u32 (index into external store, u32::MAX = no syscall)
    • Added manual Default impl with sentinel values
  2. RegIdx narrowed from usize to u8 (addr.rs):

    • All register index casts updated across disassemble, rv32im, vm_state, platform
    • Bounds checks use original u32 width before narrowing to avoid silent truncation
  3. #[repr(C)] / #[repr(u8)] added to supporting types:

    • ByteAddr, WordAddr, Instruction, InsnKind, MemOp<T>, Change<T>
    • Ensures field ordering matches CUDA struct definitions
  4. FullTracer gains syscall_witnesses: Vec<SyscallWitness>:

    • track_syscall() pushes to this vec and stores the index in StepRecord
    • New accessor syscall_witnesses() -> &[SyscallWitness]
    • reset_step_buffer() clears syscall_witnesses, keeping indices shard-local and avoiding cross-shard accumulation
  5. Layout verification tests (tracer.rs):

    • test_step_record_is_copy_and_compact: asserts Copy trait and size <= 144 bytes
    • test_step_record_layout_for_gpu: asserts exact byte offsets of every field for CUDA header alignment
    • test_supporting_types_are_copy: asserts ReadOp, WriteOp, Change are Copy

API signature changes

  • StepRecord::syscall() now takes &[SyscallWitness] parameter instead of returning from an internal Option
  • StepRecord::has_syscall() -> bool added
  • StepSource trait gains syscall_witnesses() -> &[SyscallWitness]
  • ShardContext gains syscall_witnesses: Arc<Vec<SyscallWitness>> field
  • keccak_step() test helper returns (StepRecord, Vec<Instruction>, Vec<SyscallWitness>)

Callers updated (ceno_zkvm, ceno_host)

  • All ecall instruction assign_instances methods updated to read syscall witnesses from shard_ctx.syscall_witnesses
  • OpFixedRS const generic changed from usize to u8
  • ceno_host/tests/test_elf.rs: run() returns (Vec<StepRecord>, Vec<SyscallWitness>)
  • e2e.rs: generate_witness propagates syscall witnesses into ShardContext

Copy link
Collaborator

@hero78119 hero78119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a first quick pass with some perf related question

self.vm.tracer().step_record(idx)
}

fn syscall_witnesses(&self) -> &[SyscallWitness] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this heavily calls within tracer better add #[inline(always)] to aligned with above step_record

};
tracing::debug!("position_next_shard finish in {:?}", time.elapsed());
let shard_steps = step_iter.shard_steps();
shard_ctx.syscall_witnesses = Arc::new(step_iter.syscall_witnesses().to_vec());
Copy link
Collaborator

@hero78119 hero78119 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this to_vec() step is slightly expensive because of vector clone. Can we follow above line 1293 shard_steps to get only slice, then in the place where need it, just retrieved it by slice[index]?

@Velaciela Velaciela mentioned this pull request Mar 9, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants