feat(cuda): batch FLUX execution CUDA kernel (T-005)#4
feat(cuda): batch FLUX execution CUDA kernel (T-005)#4SuperInstance wants to merge 1 commit intomainfrom
Conversation
Implements CUDA kernels for batch FLUX bytecode execution with full 37-opcode ISA support. Each CUDA thread runs one FLUX program in parallel. Files: - cuda/batch_kernel.cu: CUDA kernel + host executor with CPU fallback - cuda/batch_executor.cuh: Host-side API (init, run, destroy, pack) - cuda/batch_gpu.go: CGo GPU bindings (build tag: cuda) - cuda/batch_cpu.go: Pure Go CPU fallback (build tag: !cuda) - cuda/batch.go: Common types, error codes, BatchResult - cuda/batch_test.go: 40+ tests covering all opcodes + batch execution Key features: - Packed bytecode layout with offset/length tables - 16 GP registers, 256-entry stack per thread - Constant memory opcode format table for fast dispatch - Error codes: div-by-zero, stack overflow/underflow, max cycles - Multi-block grid for >256 programs - Persistent GPU memory with stream-based async execution Bug fix: JZ/JNZ/JLT/JGT now use int8 offsets to match Go VM reference (pkg/flux/vm.go). Previous int16 offset caused incorrect branch targets for negative offsets (e.g., loop back jumps). Ref: T-005
| if ((size_t)num_programs > handle->programs_capacity || | ||
| total_bc_size > handle->programs_capacity) { |
There was a problem hiding this comment.
🔴 GPU buffer reallocation condition compares program count against bytecode size capacity, causing out-of-bounds GPU writes
The reallocation guard at cuda/batch_kernel.cu:794 compares num_programs against handle->programs_capacity, but programs_capacity stores the total bytecode size (set at line 818), not the program count. This means if a second batch has more programs but smaller total bytecode than the first batch, the condition evaluates to false and no reallocation occurs — even though the per-program buffers (d_results, d_errors, d_cycles, d_offsets, d_lengths) are too small.
For example: first call with 10 programs and 10,000 bytes of bytecode sets programs_capacity = 10000. A second call with 500 programs and 2,000 bytes: both 500 > 10000 and 2000 > 10000 are false, so buffers allocated for 10 programs are reused. The kernel then writes results[0..499] into a buffer allocated for only 10 entries — corrupting GPU memory.
The fix should use programs_capacity_count for the count check
The condition should be:
if (total_bc_size > handle->programs_capacity ||
num_programs > handle->programs_capacity_count) {| if ((size_t)num_programs > handle->programs_capacity || | |
| total_bc_size > handle->programs_capacity) { | |
| if (total_bc_size > handle->programs_capacity || | |
| num_programs > handle->programs_capacity_count) { |
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
|
|
||
| // execute runs a FLUX bytecode program. | ||
| func (vm *cpuFluxVM) execute(bc []byte) int32 { | ||
| for !vm.halted && vm.pc < len(bc) && vm.cycles < fluxMaxCycles { |
There was a problem hiding this comment.
🟡 CPU fallback ignores custom MaxCycles from BatchConfig
The execute() method at cuda/batch_cpu.go:93 uses the hardcoded constant fluxMaxCycles (1,000,000) for the cycle limit instead of reading from the BatchExecutor.maxCycles field. The maxCycles field is properly stored by NewBatchExecutorWithConfig (cuda/batch_cpu.go:384) but never passed to cpuFluxVM or used in execute. This means calling NewBatchExecutorWithConfig(BatchConfig{MaxCycles: 100}) has no effect on cycle limits in the CPU fallback path.
Prompt for agents
The cpuFluxVM.execute() method at batch_cpu.go:93 uses the constant fluxMaxCycles instead of the BatchExecutor.maxCycles field. To fix this, the maxCycles value needs to flow from BatchExecutor.Run() into the VM execution. Options:
1. Add a maxCycles parameter to cpuFluxVM.execute(), and pass e.maxCycles from BatchExecutor.Run() (around line 410).
2. Or set a maxCycles field on cpuFluxVM before calling execute.
The key change is in Run() at line 409-410: after creating the VM, pass the executor's maxCycles to it so the cycle limit in the execute() loop (line 93) respects the configured value.
Was this helpful? React with 👍 or 👎 to provide feedback.
Implements CUDA kernels for batch FLUX bytecode execution.
Summary
Builds a complete CUDA batch execution engine for FLUX bytecode programs. Each CUDA thread runs one FLUX program in parallel, enabling execution of 1000+ programs simultaneously on NVIDIA GPUs (target: Jetson Super Orin Nano).
Files Added
cuda/batch_kernel.cucuda/batch_executor.cuhcuda/batch_gpu.gocuda)cuda/batch_cpu.go!cuda)cuda/batch.gocuda/batch_test.goKey Features
Bug Fix
JZ/JNZ/JLT/JGT now correctly use int8 offsets (matching Go VM reference
pkg/flux/vm.go). Previous implementation used int16 offsets, which caused incorrect branch targets for negative offsets (e.g., loop-back jumps in Fibonacci and factorial programs would jump past program boundaries).Test Count
40 tests (including benchmarks), covering:
Ref: T-005