Skip to content

Rewrite local_task executor and queue as process-based cooperative scheduler.#23827

Draft
benvanik wants to merge 44 commits intousers/benvanik/cpu-1from
users/benvanik/cpu-2
Draft

Rewrite local_task executor and queue as process-based cooperative scheduler.#23827
benvanik wants to merge 44 commits intousers/benvanik/cpu-1from
users/benvanik/cpu-2

Conversation

@benvanik
Copy link
Copy Markdown
Collaborator

Replaces the coordinator-based task executor and DAG-based queue with a process-based cooperative scheduling model. The old system routed every operation through 4 thread hops (wait -> issue -> dispatch shards -> retire), each involving coordinator mutex acquisition, futex wake syscalls, and context switches. The new system eliminates the coordinator entirely and reduces dispatch latency from ~300us to ~5us.

Architecture

Process model: The universal work unit is iree_task_process_t — a cooperative drainable entity with a drain function, suspend count, worker budget, and dependent list. Processes replace the previous task DAG (NOP, CALL, BARRIER, FENCE, DISPATCH, DISPATCH_SHARD types) with a single abstraction.

Executor: Workers scan compute slots and an immediate list instead of going through a coordinator. The coordinator mutex, incoming ready slist, and work-stealing infrastructure are removed. Workers use a Dekker-style sleeping protocol with adaptive spin timeouts based on aggregate worker budget.

Block ISA: A compact bytecode representation for command buffer operations that workers cooperatively execute through a block processor. Supports direct and indirect command buffers, VM dispatch fallback (VMVX), reusable recordings, and deferred fixups for buffers not mappable at recording time.

Queue: Each device queue is a persistent budget-1 process that pops operations from an MPSC ready list. Command buffer recordings are delegated to a persistent budget-N compute process for multi-worker tile distribution. Two-phase completion (eager semaphore signaling + deferred resource release) ensures low-latency signal propagation without use-after-free.

Performance

Dispatch latency on a 96-core EPYC (NUMA-pinned):

  • Single worker wake: 5.1us (target was <10us)
  • Warm worker reuse: 4.9us
  • 4-worker cold wake: 27us (<30us target for non-dominated path)

local-task w=1 overhead vs local-sync: eliminated (was +3.4% / 287M mostly spinning/contention extra instructions on Qwen3.5-4B decode, now noise-equivalent at -1% median). The fast-path empty check on slist pop/flush and the pump loop reorder (drain compute slots before immediate list) are the key contributors.

For a more direct comparison:

  ┌─────────┬──────────────┬──────────────┬─────────────────┐
  │ Threads │ IREE level3  │ ik-llama.cpp │ stock llama.cpp │
  ├─────────┼──────────────┼──────────────┼─────────────────┤
  │ 1       │ 185.0 ms/tok │ 132.7 ms/tok │ 243.0 ms/tok    │
  ├─────────┼──────────────┼──────────────┼─────────────────┤
  │ 4       │ 83.7 ms/tok  │ 73.0 ms/tok  │ 160.4 ms/tok    │
  ├─────────┼──────────────┼──────────────┼─────────────────┤
  │ 16      │ 76.8 ms/tok  │ 101.8 ms/tok │ 142.4 ms/tok    │
  └─────────┴──────────────┴──────────────┴─────────────────┘

IREE now wins at t=16 — ik-llama.cpp degrades past 4 threads (73→102ms) while we keep scaling (84→77ms). Stock llama.cpp doesn't appear to use VNNI at all despite the flags. This being despite the slower single-core time (less optimal quantized matvec).

Correctness

  • Full HAL CTS passes under both ASAN and TSAN (7 test suites each)
  • Two TSAN races found and fixed:
    • Transient buffer commit was overwriting metadata fields on the worker thread while HAL submission validation read them synchronously. Fixed by not overwriting — the creation-time params are always a conservative-correct subset.
    • Compute process two-phase completion had an ordering gap where the eager completer's atomic exchange on the context raced with the deferred releaser's free. Fixed by consuming the result before the CLOSED fetch_or while the worker is still registered.
  • Native queue operations (alloca, dealloca, fill, copy, update, read, write, dispatch) with proper semaphore wait satisfaction, frontier tracking, and error propagation
  • Async proactor file I/O with synchronous pread/pwrite fallback when async import fails
  • 2+GB file read/write support via per-operation size capping and retry loops across all three proactor backends (io_uring, IOCP, POSIX)

What was removed

The entire old task system: coordinator mutex, task DAG types (NOP/CALL/BARRIER/FENCE/DISPATCH/SHARD), shard pool, post batch routing, worker mailboxes, work stealing, submission batching, and the old task command buffer. The old queue emulation path for fill/copy/update/dispatch is no longer used by local_task (still used by GPU drivers).

benvanik and others added 30 commits March 18, 2026 01:26
…eduler.

Process replaces the 6-type task DAG with a single type that workers drain
incrementally. Atomic suspend_count for zero-hop activation, first-error-wins
CAS on error_status, cache-line-padded struct layout. Cancel of SUSPENDED
processes resolves inline (dependents, completion callback, scope_fail without
scope_end). 24 tests under ASAN+TSAN covering lifecycle, dependent resolution,
scope integration, and concurrent safety.

Co-Authored-By: Claude <noreply@anthropic.com>
…ekker sleeping protocol.

Processes are now self-contained (scope dependency removed — processes ARE
scopes). Workers pop from a lock-free MPSC immediate list and drain
cooperatively with a three-state schedule protocol (IDLE/QUEUED/DRAINING).

The sleeping protocol closes the race between "drain returned no work" and
"new work arrived while draining" using a Dekker-style pattern: the worker
stores schedule_state=IDLE (seq_cst) then loads needs_drain; the scheduler
stores needs_drain=1 then CAS(schedule_state). seq_cst on the IDLE store
provides the StoreLoad barrier required on ARM.

Cross-validated review (Codex+Gemini) findings addressed:
- seq_cst on IDLE store for Dekker StoreLoad barrier
- Assert worker_state_size==0 until compute slots land
- Assert !is_terminal in schedule_process entry
- Deterministic entered_sleep signal replacing fixed sleep_for in tests

11 executor_process_test cases including 3 stress tests (repeated sleep/wake
cycles, concurrent multi-thread sleep/wake, multi-stage dependency chains).
All 13 task tests pass ASAN and TSAN with zero warnings.

Co-Authored-By: Claude <noreply@anthropic.com>
…tions.

Replaces the per-command task DAG model with a compile-and-execute approach.
Recording compiles HAL API calls into a compact binary stream (.text) with
per-block mutable execution state (.data). Issuing initializes .data and
submits to the task executor for cooperative multi-worker execution.

Block builder: write-forward compiler with dual-cursor block layout (commands
forward, fixups backward) and automatic block splitting via BRANCH. All memory
from the block pool — no system allocator during recording.

Block processor: cooperative drain engine where workers claim tiles via atomic
CAS. Region transitions handled by elected completer with epoch-tagged
tile indices (no arrival barrier needed). Single-worker path is synchronous
with zero atomics.

Block ISA: 6-opcode command format (dispatch, fill, copy, barrier, branch,
return) with indirect/predicated/sequential flags.

Co-Authored-By: Claude <noreply@anthropic.com>
…nment fixes.

Adds the HAL vtable implementation (block_command_buffer.c) that translates
command buffer API calls into the block ISA format via the builder, and
exposes dispatch_ptrs on local_executable for recording-time function
resolution.

Eliminates the 6KB stack-allocated fixups[256] array from the dispatch
recording path: append_cmd now returns a pointer directly into block
storage for in-place fixup population, with pop_cmd rollback on failure.

Fixes two UBSAN alignment bugs found during sanitizer testing:
- Tile reservation at block end is now rounded up to fixup alignment (8
  bytes) so the fixup table always starts at a properly aligned address.
- Processor context allocation uses iree_allocator_malloc_aligned for
  64-byte cache line alignment on the false-sharing-separated atomics.

Fixes signed integer overflow in the multi-worker tile stealing CAS loop:
int32_t counter/tile_count comparisons would silently drop all work for
dispatches with >2^31 tiles. Switched to uint32_t throughout.

Co-Authored-By: Claude <noreply@anthropic.com>
Remove void* worker_state from the drain function signature — drain
functions now compute per-worker state from process->user_data and
worker_index, eliminating generic void* indirection and concurrent
reclamation complexity from the executor.

Add compute slots: a fixed-size array of atomic process pointers in the
executor for budget>1 processes. Workers scan these round-robin after
draining the immediate list, cooperatively executing bounded work from
each active process. Budget-1 processes continue using the immediate
list with the Dekker sleeping protocol. Slot lifecycle is CAS-gated:
schedule_process places via CAS(NULL→process), the completing worker
removes via CAS(process→NULL) ensuring exactly-once completion.

Generalize wake_one_worker to wake_workers(count) so compute processes
can wake workers proportional to their budget.

Co-Authored-By: Claude <noreply@anthropic.com>
…aining.

Implements the issue side of the block command buffer, bridging recorded
command buffers to the process-based executor. The issue function allocates
a cache-line-aligned processor context, initializes an embedded process
with a drain adapter, and sets up an internal completion callback that
handles processor error consumption and context cleanup before chaining
to the caller's completion callback.

The drain adapter maps block processor drain results to process drain
results, with errors deferred to the completion callback (which runs
exactly once) to avoid races between error-consuming workers and the
completion CAS.

Co-Authored-By: Claude <noreply@anthropic.com>
The queue now uses a single persistent process that drains an MPSC ready
list instead of creating a 3-task chain (wait → issue → retire) per
submission. Operations are arena-allocated at submit time and flow through
semaphore waits into the ready list. The queue process pops operations and
handles them: command buffers are issued as separate compute processes via
block_command_buffer_issue; barriers and host calls execute inline.

Cooperative shutdown: deinitialize sets a shutting_down flag and schedules
one final drain. The drain returns completed=true, triggering the process
completion callback which calls scope_end. scope_wait_idle then blocks
until the process has fully completed and no worker is touching device
resources.

Removes dead code: task_command_buffer.c/h (replaced by block command
buffer), task_queue_state.c/h (placeholder with only int reserved).

Also fixes:
- TSAN race in CTS queue_host_call_test.cc: std::thread* assignment
  after pthread_create is sequenced-after the sync point, so the
  happens-before chain from worker to main thread was broken. Fixed
  with std::atomic<std::thread*>.
- Upstream VectorizableOpInterface BUILD.bazel missing mlir:Support dep.

ASAN and TSAN clean on full local_task test suite.

Co-Authored-By: Claude <noreply@anthropic.com>
The queue-as-process rewrite (959d291d) replaced the old DAG task chain
with persistent budget-1 processes draining an MPSC ready list. This
removes all dead DAG infrastructure: task types, coordinator, submissions,
task pools, per-worker mailboxes, work stealing, task queues, benchmarks,
and associated tests.

Surviving surface: process executor with budget-1 immediate list,
budget>N compute slots, Dekker sleeping/wake protocol, cooperative
multi-worker draining, and NUMA-aware topology.

Also improves cache line layout in worker and executor structs
(cross-thread-written fields separated from owner-only fields),
cleans up tuning.h to its two surviving constants, and adds missing
direct includes exposed by removing transitive deps.

Co-Authored-By: Claude <noreply@anthropic.com>
Introduces executor_benchmark.cc with BM_WakeAllWorkers exercising the
full schedule/drain/complete/release cycle at 1-32 worker counts.

Optimizes the worker pump loop to track is_active state locally,
eliminating 2N atomic RMWs per pump cycle on the shared idle mask
when workers are spinning with work.

Fixes Dekker protocol ordering: the scheduler's needs_drain store
used release (insufficient for StoreLoad on ARM), now uses seq_cst
to match the worker side. Fixes process initialization to use explicit
atomic stores for all _Atomic fields instead of relying on memset
(undefined behavior in C11). Moves dump_wake_state from public header
to executor_impl.h, removing stdio.h from the public API surface.

Co-Authored-By: Claude <noreply@anthropic.com>
Replaces the synchronous alloca/dealloca stubs in local_task with
queue-ordered operations. Alloca returns a transient buffer handle
immediately; the drain handler allocates real backing memory and commits
it via atomic store. Dealloca decommits the backing in the drain handler.
The transient buffer uses acquire/release atomics on the committed pointer
for TSAN visibility (semaphore ordering provides real happens-before but
TSAN cannot see through it).

Also fixes a bug in iree_hal_buffer_map_range where the error path after
commit_map_range failure called iree_hal_buffer_retain instead of
iree_hal_buffer_release, leaking one buffer ref on every failed scoped
mapping.

Co-Authored-By: Claude <noreply@anthropic.com>
Replace the uint64_t typedef for iree_task_affinity_set_t with a multi-word
struct controlled by IREE_TASK_TOPOLOGY_MAX_GROUP_COUNT in tuning.h (default
256). Each affinity set is now an array of 64-bit words, with per-worker
atomic operations remaining single-word lock-free via iree_task_affinity_bit_t.

Key changes:
- affinity_set.h: Complete rewrite with struct-based types, precomputed bit
  positions, and full API (construction, queries, mutation, bulk ops, atomics).
- tuning.h: IREE_TASK_TOPOLOGY_MAX_GROUP_COUNT as primary constant with
  derived IREE_TASK_EXECUTOR_MAX_WORKER_COUNT and IREE_TASK_AFFINITY_SET_WORD_COUNT.
- executor_impl.h: Add worker_idle_count for tracing (replaces popcount on
  multi-word mask), sharing desired_wake's cache line.
- worker.h/c: worker_bit becomes iree_task_affinity_bit_t, mark_active/idle
  use atomic set helpers, relay_wake uses find_first/clear_index iteration.
- topology*.c: Group masks widened to iree_task_affinity_set_t. Fixed latent
  >64-processor bugs in cpuinfo backend (cache_bits and sharing mask now use
  proper multi-word operations with bounds checking).
- executor_test.cc: Replace uint64_t worker_mask with per-worker bool array.
- executor_benchmark.cc: Extend arg ranges to 64, 128, 256 workers.

Per-worker atomic operations (mark_active/mark_idle) remain single-word
lock-free: each worker's iree_task_affinity_bit_t addresses exactly one
64-bit word. Scanning ops (find_first across words) are a small constant
factor (4 loads + 4 comparisons for 256 workers worst case).

HAL queue affinity (iree_hal_queue_affinity_t) stays at 64 bits -- queues
map to NUMA nodes, not individual threads.

Co-Authored-By: Claude <noreply@anthropic.com>
When all 16 compute slots were occupied and a budget>1 process was
scheduled, iree_task_executor_place_in_compute_slot hit IREE_ASSERT(false)
which is stripped in release builds — silently dropping the process and
hanging its dependents indefinitely.

Replace the assert with an overflow slist. Processes that cannot be
placed into a compute slot are pushed to executor->compute_overflow.
When a worker releases a compute slot (release_compute_process step 3),
it pops from the overflow list and CAS-es the process into the newly
freed slot. The existing re-wake logic (step 4) then wakes workers for
the promoted process.

The overflow path handles the race where a concurrent schedule_process
fills the just-freed slot before the overflow promotion: in that case,
the overflow process is placed into any other empty slot, or pushed
back to the overflow list if all slots are still full.

In practice the overflow list is almost always empty — 16 concurrent
budget>1 processes is far beyond typical usage — but this eliminates
the hard slot limit and guarantees no silent process drops regardless
of workload.

Co-Authored-By: Claude <noreply@anthropic.com>
Introduces a persistent budget-N compute process per queue that occupies
a single compute slot for the queue's lifetime. The budget-1 control
process fills recording items and pushes them to the compute pending
list; the compute process drains recordings cooperatively across all
workers via the block processor.

Per-recording two-phase completion ensures semaphores are signaled
eagerly (first worker to observe completion) while resources stay alive
until all workers have exited drain (last active_drainers decrement).
Pool-based recording items (4 pre-allocated) cycle between free_pool,
pending, current, and back to free_pool with tagged ABA prevention.

Simplifies block_command_buffer by removing the issue context API
entirely — CTS is now the test surface for command buffers. Deletes
block_command_buffer_test.cc and removes scope/executor parameters
from the create signature.

Fixes two shutdown bugs in the compute process lifecycle:

  The completion callback must NOT call scope_end: other workers may
  still be inside drain, and scope_wait_idle returning would let the
  main thread free the queue while workers access it (TSAN data races
  on slist mutex, shutting_down field, and device allocation). The
  process-level scope_end is deferred to the release callback, which
  fires only after the last slot drainer exits (active_drainers
  sentinel CAS).

  Recordings that completed eagerly but whose deferred release has not
  yet fired are invisible to list-based cleanup — they are not in
  compute_current, compute_pending, or the free pool. The release
  callback now scans all pool items directly, catching items in this
  limbo state and firing their per-operation scope_end.

Adds iree_atomic_slist_discard for O(1) list clearing when the entries
are managed externally (e.g., pool-scanned cleanup).

Co-Authored-By: Claude <noreply@anthropic.com>
Replaces the synchronous file_transfer.c-based queue_read/queue_write
(which allocated 64MB staging buffers, blocked the caller, and did two
data copies) with proper async I/O through the proactor system.

Two independent code paths based on file type:
- Memory files (storage_buffer available): route directly to queue_copy
  for single-copy (memcpy) transfer through the block command buffer
  pipeline.
- FD files (async_handle available): zero-copy proactor I/O — pread/
  pwrite directly into/from mapped HAL buffers via io_uring or POSIX
  async backends.

HAL infrastructure changes:
- Add iree_hal_file_async_handle() vtable method for retrieving the
  proactor-managed async file handle (NULL for memory files).
- Add iree_hal_file_validate_access() shared utility (extracted from
  file_transfer.c).
- Add proactor parameter to iree_hal_fd_file_from_handle() and
  iree_hal_file_from_handle() — fd files dup+import at construction
  time, creating immutable fully-bound file objects.
- All other drivers pass NULL proactor (no behavioral change).

Task queue changes:
- New READ/WRITE operation types with arena-allocated I/O context that
  bridges the drain→proactor callback async gap.
- Drain handlers map the buffer, submit proactor operations, and return
  immediately. Completion callbacks fire on the proactor poll thread,
  handle coherency (flush/invalidate), unmap, and signal semaphores.
- Extract iree_hal_task_queue_op_fail() helper to deduplicate the
  frontier-fail + op_destroy pattern across four call sites.

io_uring fix:
- Create rings with R_DISABLED when SINGLE_ISSUER is requested, then
  enable via REGISTER_ENABLE_RINGS on the poll thread's first poll().
  With DEFER_TASKRUN, the kernel pins the single issuer to the
  io_uring_setup caller — R_DISABLED defers this binding so the
  proactor pool can create the ring on one thread and poll from another.

Co-Authored-By: Claude <noreply@anthropic.com>
bd-310 (P0): Fix use-after-free in enqueue_waits when acquire_timepoint
fails mid-list. Previously, registered timepoint callbacks would access
freed arena memory after the caller destroyed the operation. Now records
the error in error_status and atomically subtracts the unregistered count
from wait_count, letting registered callbacks drain naturally and destroy
the operation on the last decrement.

bd-2t9 (P1): Fix failed host calls incorrectly advancing the frontier
tracker. The call_status error was consumed by semaphore_list_fail but
never transferred to the local status variable, causing the frontier to
advance instead of fail. Now propagates call_status to status.

bd-2io (P3): Relax file length validation. Read validation now skips
the check when file_length is 0 (non-seekable fds like pipes). Write
validation is removed entirely since the OS handles file extension.

Co-Authored-By: Claude <noreply@anthropic.com>
…ops.

Extract command-building logic from block_command_buffer.c into shared
block_command_ops.h/c that both command buffers and the queue call. The
queue builds single-command recordings at drain time via the block builder
and executes them through the same block processor used by command buffers.

Native queue operations replace the emulation shims (queue_emulation.c)
for local_task: fill/copy/update execute inline (single-worker), dispatch
supports both inline (ALLOW_INLINE_EXECUTION flag) and multi-worker paths
through the compute process.

Two-phase submit (submit_op_begin/finish) eliminates boilerplate across
submit functions. Drain handlers use scoped buffer mappings with explicit
unmap for inline execution; non-inline dispatch uses persistent mappings
(pointer must survive across threads until compute process completion).

Co-Authored-By: Claude <noreply@anthropic.com>
… execution.

Move iree_hal_cmd_block_processor_context_t from block_processor.c to
block_processor.h (non-opaque). Add context_initialize for single-worker
callers that provide their own context + state storage.

The inline execution path (drain_fill/copy/update) now stack-allocates the
context and arena-allocates the .data state from the operation's arena.
This eliminates the per-operation malloc_aligned + free_aligned pair that
was the last heap allocation on the native queue path. The .data typically
fits in the same 4KB block pool block that already holds the operation.

Co-Authored-By: Claude <noreply@anthropic.com>
… fix.

Three related changes to the block ISA command buffer system:

Indirect command buffer support (bd-8l7): Accept binding_capacity > 0 in
block_command_buffer_create. Record indirect fixups (host_ptr=NULL, slot-
based) when buffer_ref.buffer is NULL. Wire the HAL binding table through
drain_commands → drain_recording → context_allocate with SCOPED mappings
and proper unmap-before-signal ordering in op_destroy.

Fixup performance optimizations: Restructure iree_hal_cmd_fixup_t with a
three-way discrimination (indirect → direct inline → span) optimized for
the indirect fast path. Replace per-binding arena-allocated spans with
inline host pointers in the fixup struct, eliminating the CB's arena
entirely. Add resolve_refs batch API. Skip resource_set inserts for
indirect dispatch bindings. Store executable + export_ordinal on dispatch
commands for VM fallback and future profiling.

Compute process release race fix: Move schedule_state IDLE transition from
eager_complete to release_compute_process. The premature IDLE in eager
completion allowed schedule_process to reschedule the compute process into
a new slot before the release callback fired, causing overlapping releases
that double-freed the processor context.

Co-Authored-By: Claude <noreply@anthropic.com>
Add length field to iree_hal_cmd_fixup_t (24→32 bytes) so that
binding_lengths[] can be populated alongside binding_ptrs[] during
fixup resolution. Previously binding_lengths was always NULL, which
caused VMVX dispatch to SEGV when wrapping bindings as iree_vm_buffer_t.

Store executable pointer and export ordinal on the dispatch command
(replacing the environment pointer, which is derivable from the
executable). When function is NULL (VMVX, JIT, external executables),
the processor dispatches through iree_hal_local_executable_issue_call
instead of the direct function pointer.

All 20 CTS dispatch tests now pass (llvm_cpu + VMVX, direct + indirect).

Co-Authored-By: Claude <noreply@anthropic.com>
Remove the one-shot-only restriction from block_command_buffer_create.
The block ISA architecture already supports reusable CBs: the recording
is immutable .text shared across concurrent submissions, each submission
gets its own .data via a separate processor context, and the CB is
retained by each operation's resource_set for the submission's lifetime.

The only new check: reject ALLOW_INLINE_EXECUTION without ONE_SHOT, per
the HAL spec (inline execution is inherently single-use).

Co-Authored-By: Claude <noreply@anthropic.com>
…ming.

iree_hal_semaphore_list_fail transfers ownership of the status to the last
semaphore (no clone). The drain_host_call error path passed call_status to
semaphore_list_fail and then assigned the same (now-consumed) pointer to
the local status variable for frontier failure propagation. Clone the
status first so both paths have valid references.

Co-Authored-By: Claude <noreply@anthropic.com>
… flags.

Two bugs found by running the full CTS file_tests suite:

1. The drain_fill/copy/update/dispatch paths set host_ptr, length, and
   flags on fixup entries but left offset and slot uninitialized. Block
   pool memory from prior recordings contained stale values. The
   resolve_bindings function adds fixup->offset to host_ptr, producing
   a corrupt pointer that reads past the buffer allocation. Set offset=0
   and slot=0 explicitly in all four drain paths.

2. Memory file buffer import used the file's access flags (READ|WRITE)
   as the buffer's allowed access. The drain_copy target mapping
   requests DISCARD_WRITE (WRITE|DISCARD), which failed the access
   check because DISCARD was missing. Add DISCARD to the import's
   access flags since the host-backed memory supports all access modes.

Co-Authored-By: Claude <noreply@anthropic.com>
…spatch.

The block processor hardcoded workgroup_state.processor_id = 0 for all
workers. This caused all workers to index the same per-worker state slot
(worker_states[0]) in the VMVX module loader, racing on the shared
module state's workgroup fields.

Thread the actual worker_index through process_region and
execute_dispatch_tiles to workgroup_state.processor_id. For single-worker
execution, this is always 0 (correct — only one worker). For multi-worker,
each worker gets its own slot.

Co-Authored-By: Claude <noreply@anthropic.com>
Three fixes applied together:

1. CLOSED_BIT protocol: Replace three separate atomic fields (generation,
   active_drainers, release_pending) with a single 64-bit drainers field:
   {gen(32) | count+CLOSED(32)}. fetch_or(CLOSED_BIT) atomically closes
   and returns count — no TOCTOU between checking count and setting the
   flag. The 64-bit generation prevents ABA on recycled pool items.

2. Back-pressure: When the compute pool is empty, COMMANDS and DISPATCH
   operations are pushed back to the ready list and the budget-1 process
   yields. compute_item_release wakes the process when a slot is returned.
   Pool size increased from 4 to 16 for deeper pipelining before
   back-pressure kicks in. Eliminates RESOURCE_EXHAUSTED under TSAN.

3. Terminal process re-scheduling fix (worker.c): After a process
   completes and its slot is released, schedule_state was set to IDLE
   unconditionally. This allowed schedule_process to CAS(IDLE→DRAINING)
   and re-place the process in a slot, causing completion and release
   callbacks to fire twice — driving scope.pending_submissions negative
   and deadlocking scope_wait_idle. Fix: only transition to IDLE if the
   process is not terminal.

Verified: 1000/1000 ASAN runs pass, 7/7 TSAN CTS pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Block processor: cache the current region's barrier pointer in block
state, eliminating the O(dispatches-before-region) linear walk that
every worker performed on each drain() call. The completer stores the
next barrier during region transitions (walking from process_region's
next_cmd output through any empty barriers), published via the existing
region_epoch release. Workers load it with a single relaxed atomic read.

Add IREE_HAL_CMD_BLOCK_PROCESSOR_COMPLETER_REENTER tuning flag (default
off) that lets the completer loop back within drain() to process the
next region immediately after a barrier, skipping the ~30-50ns pump
round-trip. Tradeoff: reduces barrier-crossing latency to near-zero but
delays immediate process draining and wake relay. Similar to the
IREE_TASK_WAKE_FANOUT tradeoff — needs real workload analysis to
determine the right default.

Executor benchmark: skip benchmarks requesting more workers than
hardware_concurrency() to prevent timeouts on small CI runners.

Fix iree-bazel-lib warning count: grep -c returns exit code 1 on zero
matches, triggering || echo 0 which appended a second "0" to stdout,
producing "0\n0" that [[ -gt ]] could not parse.

Co-Authored-By: Claude <noreply@anthropic.com>
…le access.

Use iree_allocator_malloc_aligned for structs with cache-line-aligned members
(iree_task_executor_t, iree_hal_task_device_t) and add iree_arena_allocate_aligned
for arena-allocated block state. Fix left-shift-of-negative UB in semaphore
failure value decoding by performing the shift in unsigned. Allow unaligned
memory file imports when vmfb content isn't naturally aligned.

Co-Authored-By: Claude <noreply@anthropic.com>
…ess flags.

The HAL module constructs alloca params with .access=0, relying on
canonicalize to promote to ALL. The sync driver gets this for free via
iree_hal_allocator_allocate_buffer, but our transient buffer path
skipped canonicalization, creating buffers with zero access that failed
on any subsequent map/fill operation.

Co-Authored-By: Claude <noreply@anthropic.com>
The transient buffer wrapper stored the caller's requested params at creation
time, but the backing buffer allocated later by the heap allocator has
adjusted metadata (e.g. HOST_VISIBLE added, access flags canonicalized).
Sync memory_type, allowed_access, and allowed_usage from the backing buffer
at commit time so validation sees the actual capabilities.

Co-Authored-By: Claude <noreply@anthropic.com>
…ers.

Install new recordings directly into compute_current via CAS instead of
always routing through compute_pending. This eliminates a pending-to-current
hop that required an extra worker pump cycle. Moved tag helper functions and
COMPUTE_NULL_TAG constant earlier in the file so they're available from
drain_recording.

Co-Authored-By: Claude <noreply@anthropic.com>
…inning on CLOSED.

Workers that got zero tiles on a region returned did_work=false, causing
them to park and miss all subsequent regions. Changed to did_work=true
whenever an active recording was entered, keeping workers in the pump loop
for the duration of the recording.

The CLOSED flag bail-out was returning did_work=true, causing workers to
spin on a completed recording (fetch_add/fetch_sub loop preventing the
drainer count from reaching zero). Changed to did_work=false since there
is no more work on a closed item — the next recording will re-wake via
schedule_process.

Megakernel b8s128 w=32: 168ms → 7.7ms (parity with upstream 7.65ms).

Co-Authored-By: Claude <noreply@anthropic.com>
benvanik and others added 14 commits March 18, 2026 01:26
… helpers.

Adds comprehensive CTS coverage for queue operations and command buffer
reuse patterns that were previously untested:

queue_transfer_test.cc (16 tests):
  queue_fill with 1/2/4-byte patterns, subranges, large buffers (256KB).
  queue_update from host data with offsets.
  queue_copy with source/target offsets, large buffers.
  Chained operations ordered by semaphores (fill→copy, update→copy,
  fill→copy→fill pipeline).
  queue_barrier signaling and ordering preservation.

dispatch_reuse_test.cc (12 tests):
  Reusable command buffers (MODE_DEFAULT) — the hot path for model
  execution that had zero CTS coverage. Tests record once, submit
  multiple times with different binding tables.
  Large workgroup counts (1024 — existing tests max at 32).
  alloca→execute→dealloca cycles with transient buffers.
  Pipelined alloca+execute for independent transient buffers.
  Multi-dispatch command buffers with barriers between dispatches.
  Multi-dispatch multi-resubmit (5 iterations).

test_base.h improvements:
  Ref<T> RAII template for HAL objects (buffer, command_buffer,
  semaphore, executable, file, fence) with HalTraits specializations.
  SemaphoreList factored from duplicated code in queue_alloca_test.cc
  and queue_host_call_test.cc into shared header.
  ReadBufferData<T> and ReadBufferBytes helpers for verification.

Co-Authored-By: Claude <noreply@anthropic.com>
Tests fill, copy, and update through command buffers targeting transient
buffers from queue_alloca. All existing command buffer transfer tests
used only regular (synchronously allocated) buffers — transient buffers
have a different backing mechanism and were the source of access-flag
propagation bugs.

transient_buffer_test.cc (8 tests):
  Fill transient with 1-byte and 4-byte patterns.
  Fill transient subrange with boundary verification.
  Copy regular→transient and transient→regular.
  Fill + barrier + copy in a single command buffer.
  Update transient from host data.
  Fill transient allocated with zero access flags (HAL module convention).

The zero-access-flags test specifically targets the bug documented in
fuck.md where command_buffer.fill_buffer on transient buffers hit the
same access-flags issue as queue_alloca. The fix in e2262aebe12e
(propagating backing buffer metadata at commit time) appears to have
fixed both code paths — these tests serve as regression coverage.

Co-Authored-By: Claude <noreply@anthropic.com>
Tests data flow through multi-stage dispatch pipelines using the
scale_and_offset kernel (output[i] = input[i] * scale + offset).
Each stage produces verifiable results, catching barrier ordering
bugs and data visibility issues between operations.

dispatch_pipeline_test.cc (10 tests across 2 backends):
  Host data → dispatch → verify (single-stage pipeline).
  Chained dispatches: dispatch A output feeds dispatch B input via
  barrier, verifying both intermediate and final results.
  Three-stage pipeline: update_buffer → dispatch → dispatch in a
  single command buffer (transfer→dispatch→dispatch transitions).
  Transient input pipeline: alloca → fill → dispatch → persistent
  output (the real model execution pattern).
  Reusable pipeline: record two-stage dispatch chain once, re-submit
  3 times with different input data via binding tables.

Co-Authored-By: Claude <noreply@anthropic.com>
Adds 8 new file tests covering ordering, subrange operations, and
read-modify-write pipelines for both memory files and FD files:

Memory file tests:
  Chained read→write via semaphore timeline (no host waits between
  operations — tests queue ordering correctness).
  Read and write subranges with boundary verification.
  Read-modify-write: read from source file, modify buffer via
  queue_fill, write to target file. Verifies both halves.

FD file tests (guarded by IREE_FILE_IO_ENABLE):
  Write subrange with boundary verification via re-read.
  Chained read→write between two FD files via semaphore timeline.
  Read-modify-write on a single FD file.
  Large file read (256KB) with full content verification.

Co-Authored-By: Claude <noreply@anthropic.com>
The transient buffer wrapper created by queue_alloca reported raw caller
params (e.g. DEVICE_LOCAL without HOST_VISIBLE) until the backing buffer
was committed. On CPU backends where the heap allocator adds HOST_VISIBLE
to all buffers, this caused command buffer validation to reject transient
buffers with PERMISSION_DENIED before the alloca drain fired.

Fix: run params through iree_hal_allocator_query_buffer_compatibility
before creating the transient wrapper, matching what allocate_buffer does
internally. The wrapper now reports the allocator-adjusted memory type
from creation.

Co-Authored-By: Claude <noreply@anthropic.com>
…-region corruption.

The block processor's remaining_tiles counter was a plain int32 shared
across sequential regions. When a completer advanced to the next region
(calling init_region which stores a new remaining_tiles value), stale
workers from the completed region could still have pending fetch_sub
operations. These stale decrements applied to the NEW region's count,
pushing it negative. With remaining_tiles at -1, the completer election
check (old - my_tiles == 0) could never fire, and the recording hung
forever — no worker became the completer, no cleanup, deadlock.

The fix gives remaining_tiles the same structural protection as
tile_index: epoch-tagged 64-bit atomic with CAS-based decrement. Workers
validate the epoch before decrementing. If the completer already advanced
(new epoch), the stale worker's CAS fails harmlessly — exactly like
stale tile_index CAS failures. The completer's init_region stores the
new epoch|count atomically, and new-region workers see the correct value.

Also adds CTS stress tests for rapid repeated command buffer submission.

Co-Authored-By: Claude <noreply@anthropic.com>
…d compute_current.

The recording item pool was a fixed array of 4 items embedded in the
queue struct. When all items were in-flight (being drained or awaiting
deferred release), submissions failed with RESOURCE_EXHAUSTED. This
is unacceptable — correct API usage must never hit a hard pool limit.

Replace the fixed array with arena-allocated items:
- Items are bump-allocated from an arena backed by the large block pool
  with cache-line alignment and a trailing worker_states[] FAM sized to
  the actual worker_count (was fixed at MAX_WORKER_COUNT=256).
- The free pool slist is unchanged — items cycle through it as before.
- When the free pool is empty, a new item is allocated from the arena
  inline. No RESOURCE_EXHAUSTED, no blocking.
- An all-items linked list tracks allocated items for shutdown cleanup.
- The arena is deinitialized at queue shutdown, returning all blocks.

Switch compute_current from tagged int64 (generation|pool_index) to a
direct item pointer (iree_atomic_intptr_t). This eliminates the index
lookup, the tag construction/extraction helpers, and the null sentinel
check against POOL_SIZE. ABA protection is provided by the drainers
field's generation (checked after fetch_add) and the compute_current
pointer re-check (catches recycled items).

Initial pool size increased from 4 to 16 to match typical pipeline
depths without requiring growth in the common case.

Co-Authored-By: Claude <noreply@anthropic.com>
The block command buffer maps direct buffer bindings at recording time
via iree_hal_buffer_map_range to get host pointers for inline fixups.
Transient buffers from queue_alloca have no backing memory until the
alloca operation drains (the backing is allocated asynchronously and
committed via semaphore ordering). This caused FAILED_PRECONDITION when
recording dispatches that reference queue_alloca'd buffers — the common
case for __init with io_parameters.

Add a DEFERRED fixup flag: when map_range fails at recording time, store
the buffer pointer in the fixup instead of the host pointer. At drain
time (when the buffer is guaranteed to be committed), resolve_bindings
maps it then. The buffer is retained by the CB's resource_set, so the
pointer is stable through drain.

Co-Authored-By: Claude <noreply@anthropic.com>
All three proactor backends (io_uring, IOCP, POSIX) silently truncated
file reads larger than ~2-4GB due to kernel API size limits (uint32_t
SQE len, DWORD ReadFile parameter, undefined pread behavior for
counts > INT_MAX). Loading a 5GB IRPA parameter slab failed with a
short read because sqe->len = (uint32_t)buffer.length wrapped.

Fix in two parts:
- Cap the per-operation read/write size at INT32_MAX in each backend.
  This prevents silent truncation — the backend reads up to 2GB per
  submission and reports the actual bytes transferred.
- Retry partial reads/writes in the task_queue's completion handlers.
  When bytes_transferred < requested_length and no error, the handler
  advances the buffer/offset and resubmits for the remaining bytes.
  Accumulates total_bytes_transferred across resubmissions. Only
  reports short read/write when bytes_transferred == 0 (EOF) before
  the full length is reached.

Co-Authored-By: Claude <noreply@anthropic.com>
…ling.

Two changes that together improve multi-worker throughput by ~19% on the
Qwen3.5-4B level2 decode benchmark at 16 workers:

1. Dynamic worker_budget at region transitions: The completer updates the
   compute process's worker_budget based on the next region's tile count
   (min(tiles, worker_count)). When ramping up, adds wake credits to the
   executor's desired_wake so the relay mechanism wakes additional workers.
   This prevents waking 192 workers for a 1-tile region.

2. CLOSED bail-out returns did_work=true: When a recording completes,
   workers that hit the CLOSED flag at entry were returning did_work=false,
   causing them to sleep via futex. But the completer has already installed
   the next pending recording — workers just need to loop back. Returning
   did_work=true keeps workers active across recording boundaries,
   eliminating futex round-trips between recordings.

Co-Authored-By: Claude <noreply@anthropic.com>
Workers previously checked the immediate list (budget-1 processes) before
compute slots on every pump iteration. The immediate list uses a
mutex-guarded slist — with 16 workers all popping an empty list every
iteration, the futex contention accounted for 34% kernel overhead.

Reorder: check compute slots first. If compute work was found, skip the
immediate list entirely. Workers doing tile execution never touch the
immediate list mutex. Only idle workers (no compute work) check the
immediate list, so exactly one picks up the budget-1 control process.

Reduces context switches by 23x (27,002 → 1,164) and eliminates the
16-worker performance regression on the Qwen3.5-4B decode benchmark.

Co-Authored-By: Claude <noreply@anthropic.com>
When an fd_file has no async file handle (proactor import failed or
unavailable), the queue read/write operations now fall back to
synchronous pread/pwrite via the HAL file vtable. Previously this
was a hard error ("no storage buffer and no async handle").

The sync fallback executes inline on the budget-1 control process
worker thread during drain. This blocks that worker for the I/O
duration, which is acceptable: it's the same thread that would have
been waiting for the proactor callback anyway, and the alternative
was "don't work at all."

The async proactor path remains preferred when available. The fallback
only activates when async_file is NULL (import failed or platform
doesn't support async fd import, e.g. Windows IOCP).

Co-Authored-By: Claude <noreply@anthropic.com>
…data.

transient_buffer_commit was writing memory_type, allowed_access, and
allowed_usage on the worker thread during alloca drain. Meanwhile,
HAL submission validation reads those same fields synchronously on
the submitting thread (queue_execute → validate_binding_requirements).
The semaphore dependency between alloca and execute gates the
operations but not the pre-submission validation, so TSAN correctly
reported a data race.

Fix: don't overwrite the metadata in commit. The buffer was already
initialized with the caller's requested params at creation time. The
backing buffer's params can only be a superset (allocator adds
capabilities like HOST_COHERENT, never removes requested ones), so
validation against requested params is conservative-correct. The
actual data path (map/unmap/flush/invalidate) forwards to the
committed buffer's vtable which uses its own metadata, so coherency
handling is unaffected.

Co-Authored-By: Claude <noreply@anthropic.com>
The eager completer called consume_result (atomic exchange on
context->error_status) inside compute_item_complete, then did
fetch_sub on drainers. The deferred releaser observed the
decremented drainers count, fired context_free — which frees the
memory containing error_status. TSAN reported the race between the
exchange and the free because the ordering went through an indirect
atomic chain (drainers) that TSAN could not trace.

Fix: consume the processor result in the drain function BEFORE the
CLOSED fetch_or, while the worker is still a registered drainer.
This ensures the exchange on error_status completes before any
fetch_sub that could enable the deferred release path. The consumed
status is passed directly to compute_item_complete. Workers that
lose the CLOSED race discard their snapshot.

Co-Authored-By: Claude <noreply@anthropic.com>
@benvanik benvanik added runtime Relating to the IREE runtime library hal/cpu Runtime Host/CPU-based HAL backend labels Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hal/cpu Runtime Host/CPU-based HAL backend runtime Relating to the IREE runtime library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants