Skip to content

feat(cuda): batch FLUX execution CUDA kernel (T-005)#4

Open
SuperInstance wants to merge 1 commit intomainfrom
greenhorn/T-005
Open

feat(cuda): batch FLUX execution CUDA kernel (T-005)#4
SuperInstance wants to merge 1 commit intomainfrom
greenhorn/T-005

Conversation

@SuperInstance
Copy link
Copy Markdown
Owner

@SuperInstance SuperInstance commented Apr 13, 2026

Implements CUDA kernels for batch FLUX bytecode execution.

Summary

Builds a complete CUDA batch execution engine for FLUX bytecode programs. Each CUDA thread runs one FLUX program in parallel, enabling execution of 1000+ programs simultaneously on NVIDIA GPUs (target: Jetson Super Orin Nano).

Files Added

File Description
cuda/batch_kernel.cu CUDA kernel (37 opcodes) + host executor with CPU fallback
cuda/batch_executor.cuh Host-side API header (init, run, destroy, pack)
cuda/batch_gpu.go CGo GPU bindings (build tag: cuda)
cuda/batch_cpu.go Pure Go CPU fallback (build tag: !cuda)
cuda/batch.go Common types, error codes, BatchResult
cuda/batch_test.go 40+ tests covering all opcodes + batch execution

Key Features

  • 37 opcodes: ADD, SUB, MUL, DIV, MOD, AND, OR, XOR, SHL, SHR, MIN, MAX, CMP_EQ/LT/GT/NE, MOV, MOVI, MOVI16, ADDI, SUBI, INC, DEC, NOT, NEG, PUSH, POP, RET, CALL, JMP, JZ, JNZ, JLT, JGT, LOOP, NOP, HALT, STRIPCONF
  • Packed bytecode layout with offset/length tables for efficient GPU memory access
  • 16 GP registers, 256-entry stack per thread (optimized for GPU occupancy)
  • Constant memory opcode format table for fast dispatch
  • Multi-block grid for batches exceeding 256 programs
  • Persistent GPU memory with stream-based async execution
  • Error reporting: div-by-zero, stack overflow/underflow, max cycles, bad register, A2A stub

Bug Fix

JZ/JNZ/JLT/JGT now correctly use int8 offsets (matching Go VM reference pkg/flux/vm.go). Previous implementation used int16 offsets, which caused incorrect branch targets for negative offsets (e.g., loop-back jumps in Fibonacci and factorial programs would jump past program boundaries).

Test Count

40 tests (including benchmarks), covering:

  • All arithmetic operations (ADD, SUB, MUL, DIV, MOD)
  • All bitwise operations (AND, OR, XOR, SHL, SHR)
  • All comparison operations (CMP_EQ, CMP_LT, CMP_GT, CMP_NE)
  • All branch operations (JZ, JNZ, JLT, JGT, JMP, LOOP, CALL/RET)
  • Stack operations (PUSH, POP)
  • Immediate loads (MOVI, MOVI16, ADDI, SUBI)
  • Data movement (MOV)
  • Batch execution (5, 300, 1000 programs)
  • Mixed program batches
  • Edge cases (div-by-zero, empty batch, NOP cycles)
  • Error string formatting

Ref: T-005


Staging: Open in Devin

Implements CUDA kernels for batch FLUX bytecode execution with full
37-opcode ISA support. Each CUDA thread runs one FLUX program in parallel.

Files:
- cuda/batch_kernel.cu: CUDA kernel + host executor with CPU fallback
- cuda/batch_executor.cuh: Host-side API (init, run, destroy, pack)
- cuda/batch_gpu.go: CGo GPU bindings (build tag: cuda)
- cuda/batch_cpu.go: Pure Go CPU fallback (build tag: !cuda)
- cuda/batch.go: Common types, error codes, BatchResult
- cuda/batch_test.go: 40+ tests covering all opcodes + batch execution

Key features:
- Packed bytecode layout with offset/length tables
- 16 GP registers, 256-entry stack per thread
- Constant memory opcode format table for fast dispatch
- Error codes: div-by-zero, stack overflow/underflow, max cycles
- Multi-block grid for >256 programs
- Persistent GPU memory with stream-based async execution

Bug fix: JZ/JNZ/JLT/JGT now use int8 offsets to match Go VM
reference (pkg/flux/vm.go). Previous int16 offset caused incorrect
branch targets for negative offsets (e.g., loop back jumps).

Ref: T-005
Copy link
Copy Markdown

@beta-devin-ai-integration beta-devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

Staging: Open in Devin

Comment thread cuda/batch_kernel.cu
Comment on lines +794 to +795
if ((size_t)num_programs > handle->programs_capacity ||
total_bc_size > handle->programs_capacity) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 GPU buffer reallocation condition compares program count against bytecode size capacity, causing out-of-bounds GPU writes

The reallocation guard at cuda/batch_kernel.cu:794 compares num_programs against handle->programs_capacity, but programs_capacity stores the total bytecode size (set at line 818), not the program count. This means if a second batch has more programs but smaller total bytecode than the first batch, the condition evaluates to false and no reallocation occurs — even though the per-program buffers (d_results, d_errors, d_cycles, d_offsets, d_lengths) are too small.

For example: first call with 10 programs and 10,000 bytes of bytecode sets programs_capacity = 10000. A second call with 500 programs and 2,000 bytes: both 500 > 10000 and 2000 > 10000 are false, so buffers allocated for 10 programs are reused. The kernel then writes results[0..499] into a buffer allocated for only 10 entries — corrupting GPU memory.

The fix should use programs_capacity_count for the count check

The condition should be:

if (total_bc_size > handle->programs_capacity ||
    num_programs > handle->programs_capacity_count) {
Suggested change
if ((size_t)num_programs > handle->programs_capacity ||
total_bc_size > handle->programs_capacity) {
if (total_bc_size > handle->programs_capacity ||
num_programs > handle->programs_capacity_count) {
Staging: Open in Devin

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Comment thread cuda/batch_cpu.go

// execute runs a FLUX bytecode program.
func (vm *cpuFluxVM) execute(bc []byte) int32 {
for !vm.halted && vm.pc < len(bc) && vm.cycles < fluxMaxCycles {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 CPU fallback ignores custom MaxCycles from BatchConfig

The execute() method at cuda/batch_cpu.go:93 uses the hardcoded constant fluxMaxCycles (1,000,000) for the cycle limit instead of reading from the BatchExecutor.maxCycles field. The maxCycles field is properly stored by NewBatchExecutorWithConfig (cuda/batch_cpu.go:384) but never passed to cpuFluxVM or used in execute. This means calling NewBatchExecutorWithConfig(BatchConfig{MaxCycles: 100}) has no effect on cycle limits in the CPU fallback path.

Prompt for agents
The cpuFluxVM.execute() method at batch_cpu.go:93 uses the constant fluxMaxCycles instead of the BatchExecutor.maxCycles field. To fix this, the maxCycles value needs to flow from BatchExecutor.Run() into the VM execution. Options:

1. Add a maxCycles parameter to cpuFluxVM.execute(), and pass e.maxCycles from BatchExecutor.Run() (around line 410).
2. Or set a maxCycles field on cpuFluxVM before calling execute.

The key change is in Run() at line 409-410: after creating the VM, pass the executor's maxCycles to it so the cycle limit in the execute() loop (line 93) respects the configured value.
Staging: Open in Devin

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant