feat(cuda): batch FLUX execution CUDA kernel (T-005) by SuperInstance · Pull Request #4 · SuperInstance/greenhorn-runtime

SuperInstance · 2026-04-13T12:10:21Z

Implements CUDA kernels for batch FLUX bytecode execution.

Summary

Builds a complete CUDA batch execution engine for FLUX bytecode programs. Each CUDA thread runs one FLUX program in parallel, enabling execution of 1000+ programs simultaneously on NVIDIA GPUs (target: Jetson Super Orin Nano).

Files Added

File	Description
`cuda/batch_kernel.cu`	CUDA kernel (37 opcodes) + host executor with CPU fallback
`cuda/batch_executor.cuh`	Host-side API header (init, run, destroy, pack)
`cuda/batch_gpu.go`	CGo GPU bindings (build tag: `cuda`)
`cuda/batch_cpu.go`	Pure Go CPU fallback (build tag: `!cuda`)
`cuda/batch.go`	Common types, error codes, BatchResult
`cuda/batch_test.go`	40+ tests covering all opcodes + batch execution

Key Features

37 opcodes: ADD, SUB, MUL, DIV, MOD, AND, OR, XOR, SHL, SHR, MIN, MAX, CMP_EQ/LT/GT/NE, MOV, MOVI, MOVI16, ADDI, SUBI, INC, DEC, NOT, NEG, PUSH, POP, RET, CALL, JMP, JZ, JNZ, JLT, JGT, LOOP, NOP, HALT, STRIPCONF
Packed bytecode layout with offset/length tables for efficient GPU memory access
16 GP registers, 256-entry stack per thread (optimized for GPU occupancy)
Constant memory opcode format table for fast dispatch
Multi-block grid for batches exceeding 256 programs
Persistent GPU memory with stream-based async execution
Error reporting: div-by-zero, stack overflow/underflow, max cycles, bad register, A2A stub

Bug Fix

JZ/JNZ/JLT/JGT now correctly use int8 offsets (matching Go VM reference pkg/flux/vm.go). Previous implementation used int16 offsets, which caused incorrect branch targets for negative offsets (e.g., loop-back jumps in Fibonacci and factorial programs would jump past program boundaries).

Test Count

40 tests (including benchmarks), covering:

All arithmetic operations (ADD, SUB, MUL, DIV, MOD)
All bitwise operations (AND, OR, XOR, SHL, SHR)
All comparison operations (CMP_EQ, CMP_LT, CMP_GT, CMP_NE)
All branch operations (JZ, JNZ, JLT, JGT, JMP, LOOP, CALL/RET)
Stack operations (PUSH, POP)
Immediate loads (MOVI, MOVI16, ADDI, SUBI)
Data movement (MOV)
Batch execution (5, 300, 1000 programs)
Mixed program batches
Edge cases (div-by-zero, empty batch, NOP cycles)
Error string formatting

Ref: T-005

Implements CUDA kernels for batch FLUX bytecode execution with full 37-opcode ISA support. Each CUDA thread runs one FLUX program in parallel. Files: - cuda/batch_kernel.cu: CUDA kernel + host executor with CPU fallback - cuda/batch_executor.cuh: Host-side API (init, run, destroy, pack) - cuda/batch_gpu.go: CGo GPU bindings (build tag: cuda) - cuda/batch_cpu.go: Pure Go CPU fallback (build tag: !cuda) - cuda/batch.go: Common types, error codes, BatchResult - cuda/batch_test.go: 40+ tests covering all opcodes + batch execution Key features: - Packed bytecode layout with offset/length tables - 16 GP registers, 256-entry stack per thread - Constant memory opcode format table for fast dispatch - Error codes: div-by-zero, stack overflow/underflow, max cycles - Multi-block grid for >256 programs - Persistent GPU memory with stream-based async execution Bug fix: JZ/JNZ/JLT/JGT now use int8 offsets to match Go VM reference (pkg/flux/vm.go). Previous int16 offset caused incorrect branch targets for negative offsets (e.g., loop back jumps). Ref: T-005

beta-devin-ai-integration

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

beta-devin-ai-integration · 2026-04-13T12:18:30Z

+    if ((size_t)num_programs > handle->programs_capacity ||
+        total_bc_size > handle->programs_capacity) {


🔴 GPU buffer reallocation condition compares program count against bytecode size capacity, causing out-of-bounds GPU writes

The reallocation guard at cuda/batch_kernel.cu:794 compares num_programs against handle->programs_capacity, but programs_capacity stores the total bytecode size (set at line 818), not the program count. This means if a second batch has more programs but smaller total bytecode than the first batch, the condition evaluates to false and no reallocation occurs — even though the per-program buffers (d_results, d_errors, d_cycles, d_offsets, d_lengths) are too small.

For example: first call with 10 programs and 10,000 bytes of bytecode sets programs_capacity = 10000. A second call with 500 programs and 2,000 bytes: both 500 > 10000 and 2000 > 10000 are false, so buffers allocated for 10 programs are reused. The kernel then writes results[0..499] into a buffer allocated for only 10 entries — corrupting GPU memory.

The fix should use programs_capacity_count for the count check

The condition should be:

if (total_bc_size > handle->programs_capacity || num_programs > handle->programs_capacity_count) {

Suggested change

if ((size_t)num_programs > handle->programs_capacity ||

total_bc_size > handle->programs_capacity) {

if (total_bc_size > handle->programs_capacity ||

num_programs > handle->programs_capacity_count) {

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

beta-devin-ai-integration · 2026-04-13T12:18:31Z

+
+// execute runs a FLUX bytecode program.
+func (vm *cpuFluxVM) execute(bc []byte) int32 {
+        for !vm.halted && vm.pc < len(bc) && vm.cycles < fluxMaxCycles {


🟡 CPU fallback ignores custom MaxCycles from BatchConfig

The execute() method at cuda/batch_cpu.go:93 uses the hardcoded constant fluxMaxCycles (1,000,000) for the cycle limit instead of reading from the BatchExecutor.maxCycles field. The maxCycles field is properly stored by NewBatchExecutorWithConfig (cuda/batch_cpu.go:384) but never passed to cpuFluxVM or used in execute. This means calling NewBatchExecutorWithConfig(BatchConfig{MaxCycles: 100}) has no effect on cycle limits in the CPU fallback path.

Prompt for agents

The cpuFluxVM.execute() method at batch_cpu.go:93 uses the constant fluxMaxCycles instead of the BatchExecutor.maxCycles field. To fix this, the maxCycles value needs to flow from BatchExecutor.Run() into the VM execution. Options: 1. Add a maxCycles parameter to cpuFluxVM.execute(), and pass e.maxCycles from BatchExecutor.Run() (around line 410). 2. Or set a maxCycles field on cpuFluxVM before calling execute. The key change is in Run() at line 409-410: after creating the VM, pass the executor's maxCycles to it so the cycle limit in the execute() loop (line 93) respects the configured value.

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

beta-devin-ai-integration bot reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cuda): batch FLUX execution CUDA kernel (T-005)#4

feat(cuda): batch FLUX execution CUDA kernel (T-005)#4
SuperInstance wants to merge 1 commit intomainfrom
greenhorn/T-005

SuperInstance commented Apr 13, 2026 •

edited by beta-devin-ai-integration bot

Loading

Uh oh!

beta-devin-ai-integration bot left a comment

Uh oh!

beta-devin-ai-integration bot Apr 13, 2026

Uh oh!

beta-devin-ai-integration bot Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if ((size_t)num_programs > handle->programs_capacity \|\|
		total_bc_size > handle->programs_capacity) {

Conversation

SuperInstance commented Apr 13, 2026 • edited by beta-devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files Added

Key Features

Bug Fix

Test Count

Uh oh!

beta-devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

beta-devin-ai-integration bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

beta-devin-ai-integration bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SuperInstance commented Apr 13, 2026 •

edited by beta-devin-ai-integration bot

Loading