Rewrite local_task executor and queue as process-based cooperative scheduler. by benvanik · Pull Request #23827 · iree-org/iree

benvanik · 2026-03-18T08:41:11Z

Replaces the coordinator-based task executor and DAG-based queue with a process-based cooperative scheduling model. The old system routed every operation through 4 thread hops (wait -> issue -> dispatch shards -> retire), each involving coordinator mutex acquisition, futex wake syscalls, and context switches. The new system eliminates the coordinator entirely and reduces dispatch latency from ~300us to ~5us.

Architecture

Process model: The universal work unit is iree_task_process_t — a cooperative drainable entity with a drain function, suspend count, worker budget, and dependent list. Processes replace the previous task DAG (NOP, CALL, BARRIER, FENCE, DISPATCH, DISPATCH_SHARD types) with a single abstraction.

Executor: Workers scan compute slots and an immediate list instead of going through a coordinator. The coordinator mutex, incoming ready slist, and work-stealing infrastructure are removed. Workers use a Dekker-style sleeping protocol with adaptive spin timeouts based on aggregate worker budget.

Block ISA: A compact bytecode representation for command buffer operations that workers cooperatively execute through a block processor. Supports direct and indirect command buffers, VM dispatch fallback (VMVX), reusable recordings, and deferred fixups for buffers not mappable at recording time.

Queue: Each device queue is a persistent budget-1 process that pops operations from an MPSC ready list. Command buffer recordings are delegated to a persistent budget-N compute process for multi-worker tile distribution. Two-phase completion (eager semaphore signaling + deferred resource release) ensures low-latency signal propagation without use-after-free.

Performance

Dispatch latency on a 96-core EPYC (NUMA-pinned):

Single worker wake: 5.1us (target was <10us)
Warm worker reuse: 4.9us
4-worker cold wake: 27us (<30us target for non-dominated path)

local-task w=1 overhead vs local-sync: eliminated (was +3.4% / 287M mostly spinning/contention extra instructions on Qwen3.5-4B decode, now noise-equivalent at -1% median). The fast-path empty check on slist pop/flush and the pump loop reorder (drain compute slots before immediate list) are the key contributors.

For a more direct comparison:

  ┌─────────┬──────────────┬──────────────┬─────────────────┐
  │ Threads │ IREE level3  │ ik-llama.cpp │ stock llama.cpp │
  ├─────────┼──────────────┼──────────────┼─────────────────┤
  │ 1       │ 185.0 ms/tok │ 132.7 ms/tok │ 243.0 ms/tok    │
  ├─────────┼──────────────┼──────────────┼─────────────────┤
  │ 4       │ 83.7 ms/tok  │ 73.0 ms/tok  │ 160.4 ms/tok    │
  ├─────────┼──────────────┼──────────────┼─────────────────┤
  │ 16      │ 76.8 ms/tok  │ 101.8 ms/tok │ 142.4 ms/tok    │
  └─────────┴──────────────┴──────────────┴─────────────────┘

IREE now wins at t=16 — ik-llama.cpp degrades past 4 threads (73→102ms) while we keep scaling (84→77ms). Stock llama.cpp doesn't appear to use VNNI at all despite the flags. This being despite the slower single-core time (less optimal quantized matvec).

Correctness

Full HAL CTS passes under both ASAN and TSAN (7 test suites each)
Two TSAN races found and fixed:
- Transient buffer commit was overwriting metadata fields on the worker thread while HAL submission validation read them synchronously. Fixed by not overwriting — the creation-time params are always a conservative-correct subset.
- Compute process two-phase completion had an ordering gap where the eager completer's atomic exchange on the context raced with the deferred releaser's free. Fixed by consuming the result before the CLOSED fetch_or while the worker is still registered.
Native queue operations (alloca, dealloca, fill, copy, update, read, write, dispatch) with proper semaphore wait satisfaction, frontier tracking, and error propagation
Async proactor file I/O with synchronous pread/pwrite fallback when async import fails
2+GB file read/write support via per-operation size capping and retry loops across all three proactor backends (io_uring, IOCP, POSIX)

What was removed

The entire old task system: coordinator mutex, task DAG types (NOP/CALL/BARRIER/FENCE/DISPATCH/SHARD), shard pool, post batch routing, worker mailboxes, work stealing, submission batching, and the old task command buffer. The old queue emulation path for fill/copy/update/dispatch is no longer used by local_task (still used by GPU drivers).

…eduler. Process replaces the 6-type task DAG with a single type that workers drain incrementally. Atomic suspend_count for zero-hop activation, first-error-wins CAS on error_status, cache-line-padded struct layout. Cancel of SUSPENDED processes resolves inline (dependents, completion callback, scope_fail without scope_end). 24 tests under ASAN+TSAN covering lifecycle, dependent resolution, scope integration, and concurrent safety. Co-Authored-By: Claude <noreply@anthropic.com>

…ekker sleeping protocol. Processes are now self-contained (scope dependency removed — processes ARE scopes). Workers pop from a lock-free MPSC immediate list and drain cooperatively with a three-state schedule protocol (IDLE/QUEUED/DRAINING). The sleeping protocol closes the race between "drain returned no work" and "new work arrived while draining" using a Dekker-style pattern: the worker stores schedule_state=IDLE (seq_cst) then loads needs_drain; the scheduler stores needs_drain=1 then CAS(schedule_state). seq_cst on the IDLE store provides the StoreLoad barrier required on ARM. Cross-validated review (Codex+Gemini) findings addressed: - seq_cst on IDLE store for Dekker StoreLoad barrier - Assert worker_state_size==0 until compute slots land - Assert !is_terminal in schedule_process entry - Deterministic entered_sleep signal replacing fixed sleep_for in tests 11 executor_process_test cases including 3 stress tests (repeated sleep/wake cycles, concurrent multi-thread sleep/wake, multi-stage dependency chains). All 13 task tests pass ASAN and TSAN with zero warnings. Co-Authored-By: Claude <noreply@anthropic.com>

…tions. Replaces the per-command task DAG model with a compile-and-execute approach. Recording compiles HAL API calls into a compact binary stream (.text) with per-block mutable execution state (.data). Issuing initializes .data and submits to the task executor for cooperative multi-worker execution. Block builder: write-forward compiler with dual-cursor block layout (commands forward, fixups backward) and automatic block splitting via BRANCH. All memory from the block pool — no system allocator during recording. Block processor: cooperative drain engine where workers claim tiles via atomic CAS. Region transitions handled by elected completer with epoch-tagged tile indices (no arrival barrier needed). Single-worker path is synchronous with zero atomics. Block ISA: 6-opcode command format (dispatch, fill, copy, barrier, branch, return) with indirect/predicated/sequential flags. Co-Authored-By: Claude <noreply@anthropic.com>

…nment fixes. Adds the HAL vtable implementation (block_command_buffer.c) that translates command buffer API calls into the block ISA format via the builder, and exposes dispatch_ptrs on local_executable for recording-time function resolution. Eliminates the 6KB stack-allocated fixups[256] array from the dispatch recording path: append_cmd now returns a pointer directly into block storage for in-place fixup population, with pop_cmd rollback on failure. Fixes two UBSAN alignment bugs found during sanitizer testing: - Tile reservation at block end is now rounded up to fixup alignment (8 bytes) so the fixup table always starts at a properly aligned address. - Processor context allocation uses iree_allocator_malloc_aligned for 64-byte cache line alignment on the false-sharing-separated atomics. Fixes signed integer overflow in the multi-worker tile stealing CAS loop: int32_t counter/tile_count comparisons would silently drop all work for dispatches with >2^31 tiles. Switched to uint32_t throughout. Co-Authored-By: Claude <noreply@anthropic.com>

Remove void* worker_state from the drain function signature — drain functions now compute per-worker state from process->user_data and worker_index, eliminating generic void* indirection and concurrent reclamation complexity from the executor. Add compute slots: a fixed-size array of atomic process pointers in the executor for budget>1 processes. Workers scan these round-robin after draining the immediate list, cooperatively executing bounded work from each active process. Budget-1 processes continue using the immediate list with the Dekker sleeping protocol. Slot lifecycle is CAS-gated: schedule_process places via CAS(NULL→process), the completing worker removes via CAS(process→NULL) ensuring exactly-once completion. Generalize wake_one_worker to wake_workers(count) so compute processes can wake workers proportional to their budget. Co-Authored-By: Claude <noreply@anthropic.com>

…aining. Implements the issue side of the block command buffer, bridging recorded command buffers to the process-based executor. The issue function allocates a cache-line-aligned processor context, initializes an embedded process with a drain adapter, and sets up an internal completion callback that handles processor error consumption and context cleanup before chaining to the caller's completion callback. The drain adapter maps block processor drain results to process drain results, with errors deferred to the completion callback (which runs exactly once) to avoid races between error-consuming workers and the completion CAS. Co-Authored-By: Claude <noreply@anthropic.com>

The queue now uses a single persistent process that drains an MPSC ready list instead of creating a 3-task chain (wait → issue → retire) per submission. Operations are arena-allocated at submit time and flow through semaphore waits into the ready list. The queue process pops operations and handles them: command buffers are issued as separate compute processes via block_command_buffer_issue; barriers and host calls execute inline. Cooperative shutdown: deinitialize sets a shutting_down flag and schedules one final drain. The drain returns completed=true, triggering the process completion callback which calls scope_end. scope_wait_idle then blocks until the process has fully completed and no worker is touching device resources. Removes dead code: task_command_buffer.c/h (replaced by block command buffer), task_queue_state.c/h (placeholder with only int reserved). Also fixes: - TSAN race in CTS queue_host_call_test.cc: std::thread* assignment after pthread_create is sequenced-after the sync point, so the happens-before chain from worker to main thread was broken. Fixed with std::atomic<std::thread*>. - Upstream VectorizableOpInterface BUILD.bazel missing mlir:Support dep. ASAN and TSAN clean on full local_task test suite. Co-Authored-By: Claude <noreply@anthropic.com>

The queue-as-process rewrite (959d291d) replaced the old DAG task chain with persistent budget-1 processes draining an MPSC ready list. This removes all dead DAG infrastructure: task types, coordinator, submissions, task pools, per-worker mailboxes, work stealing, task queues, benchmarks, and associated tests. Surviving surface: process executor with budget-1 immediate list, budget>N compute slots, Dekker sleeping/wake protocol, cooperative multi-worker draining, and NUMA-aware topology. Also improves cache line layout in worker and executor structs (cross-thread-written fields separated from owner-only fields), cleans up tuning.h to its two surviving constants, and adds missing direct includes exposed by removing transitive deps. Co-Authored-By: Claude <noreply@anthropic.com>

Introduces executor_benchmark.cc with BM_WakeAllWorkers exercising the full schedule/drain/complete/release cycle at 1-32 worker counts. Optimizes the worker pump loop to track is_active state locally, eliminating 2N atomic RMWs per pump cycle on the shared idle mask when workers are spinning with work. Fixes Dekker protocol ordering: the scheduler's needs_drain store used release (insufficient for StoreLoad on ARM), now uses seq_cst to match the worker side. Fixes process initialization to use explicit atomic stores for all _Atomic fields instead of relying on memset (undefined behavior in C11). Moves dump_wake_state from public header to executor_impl.h, removing stdio.h from the public API surface. Co-Authored-By: Claude <noreply@anthropic.com>

Replaces the synchronous alloca/dealloca stubs in local_task with queue-ordered operations. Alloca returns a transient buffer handle immediately; the drain handler allocates real backing memory and commits it via atomic store. Dealloca decommits the backing in the drain handler. The transient buffer uses acquire/release atomics on the committed pointer for TSAN visibility (semaphore ordering provides real happens-before but TSAN cannot see through it). Also fixes a bug in iree_hal_buffer_map_range where the error path after commit_map_range failure called iree_hal_buffer_retain instead of iree_hal_buffer_release, leaking one buffer ref on every failed scoped mapping. Co-Authored-By: Claude <noreply@anthropic.com>

Replace the uint64_t typedef for iree_task_affinity_set_t with a multi-word struct controlled by IREE_TASK_TOPOLOGY_MAX_GROUP_COUNT in tuning.h (default 256). Each affinity set is now an array of 64-bit words, with per-worker atomic operations remaining single-word lock-free via iree_task_affinity_bit_t. Key changes: - affinity_set.h: Complete rewrite with struct-based types, precomputed bit positions, and full API (construction, queries, mutation, bulk ops, atomics). - tuning.h: IREE_TASK_TOPOLOGY_MAX_GROUP_COUNT as primary constant with derived IREE_TASK_EXECUTOR_MAX_WORKER_COUNT and IREE_TASK_AFFINITY_SET_WORD_COUNT. - executor_impl.h: Add worker_idle_count for tracing (replaces popcount on multi-word mask), sharing desired_wake's cache line. - worker.h/c: worker_bit becomes iree_task_affinity_bit_t, mark_active/idle use atomic set helpers, relay_wake uses find_first/clear_index iteration. - topology*.c: Group masks widened to iree_task_affinity_set_t. Fixed latent >64-processor bugs in cpuinfo backend (cache_bits and sharing mask now use proper multi-word operations with bounds checking). - executor_test.cc: Replace uint64_t worker_mask with per-worker bool array. - executor_benchmark.cc: Extend arg ranges to 64, 128, 256 workers. Per-worker atomic operations (mark_active/mark_idle) remain single-word lock-free: each worker's iree_task_affinity_bit_t addresses exactly one 64-bit word. Scanning ops (find_first across words) are a small constant factor (4 loads + 4 comparisons for 256 workers worst case). HAL queue affinity (iree_hal_queue_affinity_t) stays at 64 bits -- queues map to NUMA nodes, not individual threads. Co-Authored-By: Claude <noreply@anthropic.com>

When all 16 compute slots were occupied and a budget>1 process was scheduled, iree_task_executor_place_in_compute_slot hit IREE_ASSERT(false) which is stripped in release builds — silently dropping the process and hanging its dependents indefinitely. Replace the assert with an overflow slist. Processes that cannot be placed into a compute slot are pushed to executor->compute_overflow. When a worker releases a compute slot (release_compute_process step 3), it pops from the overflow list and CAS-es the process into the newly freed slot. The existing re-wake logic (step 4) then wakes workers for the promoted process. The overflow path handles the race where a concurrent schedule_process fills the just-freed slot before the overflow promotion: in that case, the overflow process is placed into any other empty slot, or pushed back to the overflow list if all slots are still full. In practice the overflow list is almost always empty — 16 concurrent budget>1 processes is far beyond typical usage — but this eliminates the hard slot limit and guarantees no silent process drops regardless of workload. Co-Authored-By: Claude <noreply@anthropic.com>

Introduces a persistent budget-N compute process per queue that occupies a single compute slot for the queue's lifetime. The budget-1 control process fills recording items and pushes them to the compute pending list; the compute process drains recordings cooperatively across all workers via the block processor. Per-recording two-phase completion ensures semaphores are signaled eagerly (first worker to observe completion) while resources stay alive until all workers have exited drain (last active_drainers decrement). Pool-based recording items (4 pre-allocated) cycle between free_pool, pending, current, and back to free_pool with tagged ABA prevention. Simplifies block_command_buffer by removing the issue context API entirely — CTS is now the test surface for command buffers. Deletes block_command_buffer_test.cc and removes scope/executor parameters from the create signature. Fixes two shutdown bugs in the compute process lifecycle: The completion callback must NOT call scope_end: other workers may still be inside drain, and scope_wait_idle returning would let the main thread free the queue while workers access it (TSAN data races on slist mutex, shutting_down field, and device allocation). The process-level scope_end is deferred to the release callback, which fires only after the last slot drainer exits (active_drainers sentinel CAS). Recordings that completed eagerly but whose deferred release has not yet fired are invisible to list-based cleanup — they are not in compute_current, compute_pending, or the free pool. The release callback now scans all pool items directly, catching items in this limbo state and firing their per-operation scope_end. Adds iree_atomic_slist_discard for O(1) list clearing when the entries are managed externally (e.g., pool-scanned cleanup). Co-Authored-By: Claude <noreply@anthropic.com>

Replaces the synchronous file_transfer.c-based queue_read/queue_write (which allocated 64MB staging buffers, blocked the caller, and did two data copies) with proper async I/O through the proactor system. Two independent code paths based on file type: - Memory files (storage_buffer available): route directly to queue_copy for single-copy (memcpy) transfer through the block command buffer pipeline. - FD files (async_handle available): zero-copy proactor I/O — pread/ pwrite directly into/from mapped HAL buffers via io_uring or POSIX async backends. HAL infrastructure changes: - Add iree_hal_file_async_handle() vtable method for retrieving the proactor-managed async file handle (NULL for memory files). - Add iree_hal_file_validate_access() shared utility (extracted from file_transfer.c). - Add proactor parameter to iree_hal_fd_file_from_handle() and iree_hal_file_from_handle() — fd files dup+import at construction time, creating immutable fully-bound file objects. - All other drivers pass NULL proactor (no behavioral change). Task queue changes: - New READ/WRITE operation types with arena-allocated I/O context that bridges the drain→proactor callback async gap. - Drain handlers map the buffer, submit proactor operations, and return immediately. Completion callbacks fire on the proactor poll thread, handle coherency (flush/invalidate), unmap, and signal semaphores. - Extract iree_hal_task_queue_op_fail() helper to deduplicate the frontier-fail + op_destroy pattern across four call sites. io_uring fix: - Create rings with R_DISABLED when SINGLE_ISSUER is requested, then enable via REGISTER_ENABLE_RINGS on the poll thread's first poll(). With DEFER_TASKRUN, the kernel pins the single issuer to the io_uring_setup caller — R_DISABLED defers this binding so the proactor pool can create the ring on one thread and poll from another. Co-Authored-By: Claude <noreply@anthropic.com>

bd-310 (P0): Fix use-after-free in enqueue_waits when acquire_timepoint fails mid-list. Previously, registered timepoint callbacks would access freed arena memory after the caller destroyed the operation. Now records the error in error_status and atomically subtracts the unregistered count from wait_count, letting registered callbacks drain naturally and destroy the operation on the last decrement. bd-2t9 (P1): Fix failed host calls incorrectly advancing the frontier tracker. The call_status error was consumed by semaphore_list_fail but never transferred to the local status variable, causing the frontier to advance instead of fail. Now propagates call_status to status. bd-2io (P3): Relax file length validation. Read validation now skips the check when file_length is 0 (non-seekable fds like pipes). Write validation is removed entirely since the OS handles file extension. Co-Authored-By: Claude <noreply@anthropic.com>

…ops. Extract command-building logic from block_command_buffer.c into shared block_command_ops.h/c that both command buffers and the queue call. The queue builds single-command recordings at drain time via the block builder and executes them through the same block processor used by command buffers. Native queue operations replace the emulation shims (queue_emulation.c) for local_task: fill/copy/update execute inline (single-worker), dispatch supports both inline (ALLOW_INLINE_EXECUTION flag) and multi-worker paths through the compute process. Two-phase submit (submit_op_begin/finish) eliminates boilerplate across submit functions. Drain handlers use scoped buffer mappings with explicit unmap for inline execution; non-inline dispatch uses persistent mappings (pointer must survive across threads until compute process completion). Co-Authored-By: Claude <noreply@anthropic.com>

… execution. Move iree_hal_cmd_block_processor_context_t from block_processor.c to block_processor.h (non-opaque). Add context_initialize for single-worker callers that provide their own context + state storage. The inline execution path (drain_fill/copy/update) now stack-allocates the context and arena-allocates the .data state from the operation's arena. This eliminates the per-operation malloc_aligned + free_aligned pair that was the last heap allocation on the native queue path. The .data typically fits in the same 4KB block pool block that already holds the operation. Co-Authored-By: Claude <noreply@anthropic.com>

… fix. Three related changes to the block ISA command buffer system: Indirect command buffer support (bd-8l7): Accept binding_capacity > 0 in block_command_buffer_create. Record indirect fixups (host_ptr=NULL, slot- based) when buffer_ref.buffer is NULL. Wire the HAL binding table through drain_commands → drain_recording → context_allocate with SCOPED mappings and proper unmap-before-signal ordering in op_destroy. Fixup performance optimizations: Restructure iree_hal_cmd_fixup_t with a three-way discrimination (indirect → direct inline → span) optimized for the indirect fast path. Replace per-binding arena-allocated spans with inline host pointers in the fixup struct, eliminating the CB's arena entirely. Add resolve_refs batch API. Skip resource_set inserts for indirect dispatch bindings. Store executable + export_ordinal on dispatch commands for VM fallback and future profiling. Compute process release race fix: Move schedule_state IDLE transition from eager_complete to release_compute_process. The premature IDLE in eager completion allowed schedule_process to reschedule the compute process into a new slot before the release callback fired, causing overlapping releases that double-freed the processor context. Co-Authored-By: Claude <noreply@anthropic.com>

Add length field to iree_hal_cmd_fixup_t (24→32 bytes) so that binding_lengths[] can be populated alongside binding_ptrs[] during fixup resolution. Previously binding_lengths was always NULL, which caused VMVX dispatch to SEGV when wrapping bindings as iree_vm_buffer_t. Store executable pointer and export ordinal on the dispatch command (replacing the environment pointer, which is derivable from the executable). When function is NULL (VMVX, JIT, external executables), the processor dispatches through iree_hal_local_executable_issue_call instead of the direct function pointer. All 20 CTS dispatch tests now pass (llvm_cpu + VMVX, direct + indirect). Co-Authored-By: Claude <noreply@anthropic.com>

Remove the one-shot-only restriction from block_command_buffer_create. The block ISA architecture already supports reusable CBs: the recording is immutable .text shared across concurrent submissions, each submission gets its own .data via a separate processor context, and the CB is retained by each operation's resource_set for the submission's lifetime. The only new check: reject ALLOW_INLINE_EXECUTION without ONE_SHOT, per the HAL spec (inline execution is inherently single-use). Co-Authored-By: Claude <noreply@anthropic.com>

…ming. iree_hal_semaphore_list_fail transfers ownership of the status to the last semaphore (no clone). The drain_host_call error path passed call_status to semaphore_list_fail and then assigned the same (now-consumed) pointer to the local status variable for frontier failure propagation. Clone the status first so both paths have valid references. Co-Authored-By: Claude <noreply@anthropic.com>

… flags. Two bugs found by running the full CTS file_tests suite: 1. The drain_fill/copy/update/dispatch paths set host_ptr, length, and flags on fixup entries but left offset and slot uninitialized. Block pool memory from prior recordings contained stale values. The resolve_bindings function adds fixup->offset to host_ptr, producing a corrupt pointer that reads past the buffer allocation. Set offset=0 and slot=0 explicitly in all four drain paths. 2. Memory file buffer import used the file's access flags (READ|WRITE) as the buffer's allowed access. The drain_copy target mapping requests DISCARD_WRITE (WRITE|DISCARD), which failed the access check because DISCARD was missing. Add DISCARD to the import's access flags since the host-backed memory supports all access modes. Co-Authored-By: Claude <noreply@anthropic.com>

…spatch. The block processor hardcoded workgroup_state.processor_id = 0 for all workers. This caused all workers to index the same per-worker state slot (worker_states[0]) in the VMVX module loader, racing on the shared module state's workgroup fields. Thread the actual worker_index through process_region and execute_dispatch_tiles to workgroup_state.processor_id. For single-worker execution, this is always 0 (correct — only one worker). For multi-worker, each worker gets its own slot. Co-Authored-By: Claude <noreply@anthropic.com>

Three fixes applied together: 1. CLOSED_BIT protocol: Replace three separate atomic fields (generation, active_drainers, release_pending) with a single 64-bit drainers field: {gen(32) | count+CLOSED(32)}. fetch_or(CLOSED_BIT) atomically closes and returns count — no TOCTOU between checking count and setting the flag. The 64-bit generation prevents ABA on recycled pool items. 2. Back-pressure: When the compute pool is empty, COMMANDS and DISPATCH operations are pushed back to the ready list and the budget-1 process yields. compute_item_release wakes the process when a slot is returned. Pool size increased from 4 to 16 for deeper pipelining before back-pressure kicks in. Eliminates RESOURCE_EXHAUSTED under TSAN. 3. Terminal process re-scheduling fix (worker.c): After a process completes and its slot is released, schedule_state was set to IDLE unconditionally. This allowed schedule_process to CAS(IDLE→DRAINING) and re-place the process in a slot, causing completion and release callbacks to fire twice — driving scope.pending_submissions negative and deadlocking scope_wait_idle. Fix: only transition to IDLE if the process is not terminal. Verified: 1000/1000 ASAN runs pass, 7/7 TSAN CTS pass. Co-Authored-By: Claude <noreply@anthropic.com>

Block processor: cache the current region's barrier pointer in block state, eliminating the O(dispatches-before-region) linear walk that every worker performed on each drain() call. The completer stores the next barrier during region transitions (walking from process_region's next_cmd output through any empty barriers), published via the existing region_epoch release. Workers load it with a single relaxed atomic read. Add IREE_HAL_CMD_BLOCK_PROCESSOR_COMPLETER_REENTER tuning flag (default off) that lets the completer loop back within drain() to process the next region immediately after a barrier, skipping the ~30-50ns pump round-trip. Tradeoff: reduces barrier-crossing latency to near-zero but delays immediate process draining and wake relay. Similar to the IREE_TASK_WAKE_FANOUT tradeoff — needs real workload analysis to determine the right default. Executor benchmark: skip benchmarks requesting more workers than hardware_concurrency() to prevent timeouts on small CI runners. Fix iree-bazel-lib warning count: grep -c returns exit code 1 on zero matches, triggering || echo 0 which appended a second "0" to stdout, producing "0\n0" that [[ -gt ]] could not parse. Co-Authored-By: Claude <noreply@anthropic.com>

…le access. Use iree_allocator_malloc_aligned for structs with cache-line-aligned members (iree_task_executor_t, iree_hal_task_device_t) and add iree_arena_allocate_aligned for arena-allocated block state. Fix left-shift-of-negative UB in semaphore failure value decoding by performing the shift in unsigned. Allow unaligned memory file imports when vmfb content isn't naturally aligned. Co-Authored-By: Claude <noreply@anthropic.com>

…ess flags. The HAL module constructs alloca params with .access=0, relying on canonicalize to promote to ALL. The sync driver gets this for free via iree_hal_allocator_allocate_buffer, but our transient buffer path skipped canonicalization, creating buffers with zero access that failed on any subsequent map/fill operation. Co-Authored-By: Claude <noreply@anthropic.com>

The transient buffer wrapper stored the caller's requested params at creation time, but the backing buffer allocated later by the heap allocator has adjusted metadata (e.g. HOST_VISIBLE added, access flags canonicalized). Sync memory_type, allowed_access, and allowed_usage from the backing buffer at commit time so validation sees the actual capabilities. Co-Authored-By: Claude <noreply@anthropic.com>

…ers. Install new recordings directly into compute_current via CAS instead of always routing through compute_pending. This eliminates a pending-to-current hop that required an extra worker pump cycle. Moved tag helper functions and COMPUTE_NULL_TAG constant earlier in the file so they're available from drain_recording. Co-Authored-By: Claude <noreply@anthropic.com>

…inning on CLOSED. Workers that got zero tiles on a region returned did_work=false, causing them to park and miss all subsequent regions. Changed to did_work=true whenever an active recording was entered, keeping workers in the pump loop for the duration of the recording. The CLOSED flag bail-out was returning did_work=true, causing workers to spin on a completed recording (fetch_add/fetch_sub loop preventing the drainer count from reaching zero). Changed to did_work=false since there is no more work on a closed item — the next recording will re-wake via schedule_process. Megakernel b8s128 w=32: 168ms → 7.7ms (parity with upstream 7.65ms). Co-Authored-By: Claude <noreply@anthropic.com>

… helpers. Adds comprehensive CTS coverage for queue operations and command buffer reuse patterns that were previously untested: queue_transfer_test.cc (16 tests): queue_fill with 1/2/4-byte patterns, subranges, large buffers (256KB). queue_update from host data with offsets. queue_copy with source/target offsets, large buffers. Chained operations ordered by semaphores (fill→copy, update→copy, fill→copy→fill pipeline). queue_barrier signaling and ordering preservation. dispatch_reuse_test.cc (12 tests): Reusable command buffers (MODE_DEFAULT) — the hot path for model execution that had zero CTS coverage. Tests record once, submit multiple times with different binding tables. Large workgroup counts (1024 — existing tests max at 32). alloca→execute→dealloca cycles with transient buffers. Pipelined alloca+execute for independent transient buffers. Multi-dispatch command buffers with barriers between dispatches. Multi-dispatch multi-resubmit (5 iterations). test_base.h improvements: Ref<T> RAII template for HAL objects (buffer, command_buffer, semaphore, executable, file, fence) with HalTraits specializations. SemaphoreList factored from duplicated code in queue_alloca_test.cc and queue_host_call_test.cc into shared header. ReadBufferData<T> and ReadBufferBytes helpers for verification. Co-Authored-By: Claude <noreply@anthropic.com>

Tests fill, copy, and update through command buffers targeting transient buffers from queue_alloca. All existing command buffer transfer tests used only regular (synchronously allocated) buffers — transient buffers have a different backing mechanism and were the source of access-flag propagation bugs. transient_buffer_test.cc (8 tests): Fill transient with 1-byte and 4-byte patterns. Fill transient subrange with boundary verification. Copy regular→transient and transient→regular. Fill + barrier + copy in a single command buffer. Update transient from host data. Fill transient allocated with zero access flags (HAL module convention). The zero-access-flags test specifically targets the bug documented in fuck.md where command_buffer.fill_buffer on transient buffers hit the same access-flags issue as queue_alloca. The fix in e2262aebe12e (propagating backing buffer metadata at commit time) appears to have fixed both code paths — these tests serve as regression coverage. Co-Authored-By: Claude <noreply@anthropic.com>

Tests data flow through multi-stage dispatch pipelines using the scale_and_offset kernel (output[i] = input[i] * scale + offset). Each stage produces verifiable results, catching barrier ordering bugs and data visibility issues between operations. dispatch_pipeline_test.cc (10 tests across 2 backends): Host data → dispatch → verify (single-stage pipeline). Chained dispatches: dispatch A output feeds dispatch B input via barrier, verifying both intermediate and final results. Three-stage pipeline: update_buffer → dispatch → dispatch in a single command buffer (transfer→dispatch→dispatch transitions). Transient input pipeline: alloca → fill → dispatch → persistent output (the real model execution pattern). Reusable pipeline: record two-stage dispatch chain once, re-submit 3 times with different input data via binding tables. Co-Authored-By: Claude <noreply@anthropic.com>

Adds 8 new file tests covering ordering, subrange operations, and read-modify-write pipelines for both memory files and FD files: Memory file tests: Chained read→write via semaphore timeline (no host waits between operations — tests queue ordering correctness). Read and write subranges with boundary verification. Read-modify-write: read from source file, modify buffer via queue_fill, write to target file. Verifies both halves. FD file tests (guarded by IREE_FILE_IO_ENABLE): Write subrange with boundary verification via re-read. Chained read→write between two FD files via semaphore timeline. Read-modify-write on a single FD file. Large file read (256KB) with full content verification. Co-Authored-By: Claude <noreply@anthropic.com>

The transient buffer wrapper created by queue_alloca reported raw caller params (e.g. DEVICE_LOCAL without HOST_VISIBLE) until the backing buffer was committed. On CPU backends where the heap allocator adds HOST_VISIBLE to all buffers, this caused command buffer validation to reject transient buffers with PERMISSION_DENIED before the alloca drain fired. Fix: run params through iree_hal_allocator_query_buffer_compatibility before creating the transient wrapper, matching what allocate_buffer does internally. The wrapper now reports the allocator-adjusted memory type from creation. Co-Authored-By: Claude <noreply@anthropic.com>

…-region corruption. The block processor's remaining_tiles counter was a plain int32 shared across sequential regions. When a completer advanced to the next region (calling init_region which stores a new remaining_tiles value), stale workers from the completed region could still have pending fetch_sub operations. These stale decrements applied to the NEW region's count, pushing it negative. With remaining_tiles at -1, the completer election check (old - my_tiles == 0) could never fire, and the recording hung forever — no worker became the completer, no cleanup, deadlock. The fix gives remaining_tiles the same structural protection as tile_index: epoch-tagged 64-bit atomic with CAS-based decrement. Workers validate the epoch before decrementing. If the completer already advanced (new epoch), the stale worker's CAS fails harmlessly — exactly like stale tile_index CAS failures. The completer's init_region stores the new epoch|count atomically, and new-region workers see the correct value. Also adds CTS stress tests for rapid repeated command buffer submission. Co-Authored-By: Claude <noreply@anthropic.com>

…d compute_current. The recording item pool was a fixed array of 4 items embedded in the queue struct. When all items were in-flight (being drained or awaiting deferred release), submissions failed with RESOURCE_EXHAUSTED. This is unacceptable — correct API usage must never hit a hard pool limit. Replace the fixed array with arena-allocated items: - Items are bump-allocated from an arena backed by the large block pool with cache-line alignment and a trailing worker_states[] FAM sized to the actual worker_count (was fixed at MAX_WORKER_COUNT=256). - The free pool slist is unchanged — items cycle through it as before. - When the free pool is empty, a new item is allocated from the arena inline. No RESOURCE_EXHAUSTED, no blocking. - An all-items linked list tracks allocated items for shutdown cleanup. - The arena is deinitialized at queue shutdown, returning all blocks. Switch compute_current from tagged int64 (generation|pool_index) to a direct item pointer (iree_atomic_intptr_t). This eliminates the index lookup, the tag construction/extraction helpers, and the null sentinel check against POOL_SIZE. ABA protection is provided by the drainers field's generation (checked after fetch_add) and the compute_current pointer re-check (catches recycled items). Initial pool size increased from 4 to 16 to match typical pipeline depths without requiring growth in the common case. Co-Authored-By: Claude <noreply@anthropic.com>

The block command buffer maps direct buffer bindings at recording time via iree_hal_buffer_map_range to get host pointers for inline fixups. Transient buffers from queue_alloca have no backing memory until the alloca operation drains (the backing is allocated asynchronously and committed via semaphore ordering). This caused FAILED_PRECONDITION when recording dispatches that reference queue_alloca'd buffers — the common case for __init with io_parameters. Add a DEFERRED fixup flag: when map_range fails at recording time, store the buffer pointer in the fixup instead of the host pointer. At drain time (when the buffer is guaranteed to be committed), resolve_bindings maps it then. The buffer is retained by the CB's resource_set, so the pointer is stable through drain. Co-Authored-By: Claude <noreply@anthropic.com>

All three proactor backends (io_uring, IOCP, POSIX) silently truncated file reads larger than ~2-4GB due to kernel API size limits (uint32_t SQE len, DWORD ReadFile parameter, undefined pread behavior for counts > INT_MAX). Loading a 5GB IRPA parameter slab failed with a short read because sqe->len = (uint32_t)buffer.length wrapped. Fix in two parts: - Cap the per-operation read/write size at INT32_MAX in each backend. This prevents silent truncation — the backend reads up to 2GB per submission and reports the actual bytes transferred. - Retry partial reads/writes in the task_queue's completion handlers. When bytes_transferred < requested_length and no error, the handler advances the buffer/offset and resubmits for the remaining bytes. Accumulates total_bytes_transferred across resubmissions. Only reports short read/write when bytes_transferred == 0 (EOF) before the full length is reached. Co-Authored-By: Claude <noreply@anthropic.com>

…ling. Two changes that together improve multi-worker throughput by ~19% on the Qwen3.5-4B level2 decode benchmark at 16 workers: 1. Dynamic worker_budget at region transitions: The completer updates the compute process's worker_budget based on the next region's tile count (min(tiles, worker_count)). When ramping up, adds wake credits to the executor's desired_wake so the relay mechanism wakes additional workers. This prevents waking 192 workers for a 1-tile region. 2. CLOSED bail-out returns did_work=true: When a recording completes, workers that hit the CLOSED flag at entry were returning did_work=false, causing them to sleep via futex. But the completer has already installed the next pending recording — workers just need to loop back. Returning did_work=true keeps workers active across recording boundaries, eliminating futex round-trips between recordings. Co-Authored-By: Claude <noreply@anthropic.com>

Workers previously checked the immediate list (budget-1 processes) before compute slots on every pump iteration. The immediate list uses a mutex-guarded slist — with 16 workers all popping an empty list every iteration, the futex contention accounted for 34% kernel overhead. Reorder: check compute slots first. If compute work was found, skip the immediate list entirely. Workers doing tile execution never touch the immediate list mutex. Only idle workers (no compute work) check the immediate list, so exactly one picks up the budget-1 control process. Reduces context switches by 23x (27,002 → 1,164) and eliminates the 16-worker performance regression on the Qwen3.5-4B decode benchmark. Co-Authored-By: Claude <noreply@anthropic.com>

When an fd_file has no async file handle (proactor import failed or unavailable), the queue read/write operations now fall back to synchronous pread/pwrite via the HAL file vtable. Previously this was a hard error ("no storage buffer and no async handle"). The sync fallback executes inline on the budget-1 control process worker thread during drain. This blocks that worker for the I/O duration, which is acceptable: it's the same thread that would have been waiting for the proactor callback anyway, and the alternative was "don't work at all." The async proactor path remains preferred when available. The fallback only activates when async_file is NULL (import failed or platform doesn't support async fd import, e.g. Windows IOCP). Co-Authored-By: Claude <noreply@anthropic.com>

…data. transient_buffer_commit was writing memory_type, allowed_access, and allowed_usage on the worker thread during alloca drain. Meanwhile, HAL submission validation reads those same fields synchronously on the submitting thread (queue_execute → validate_binding_requirements). The semaphore dependency between alloca and execute gates the operations but not the pre-submission validation, so TSAN correctly reported a data race. Fix: don't overwrite the metadata in commit. The buffer was already initialized with the caller's requested params at creation time. The backing buffer's params can only be a superset (allocator adds capabilities like HOST_COHERENT, never removes requested ones), so validation against requested params is conservative-correct. The actual data path (map/unmap/flush/invalidate) forwards to the committed buffer's vtable which uses its own metadata, so coherency handling is unaffected. Co-Authored-By: Claude <noreply@anthropic.com>

The eager completer called consume_result (atomic exchange on context->error_status) inside compute_item_complete, then did fetch_sub on drainers. The deferred releaser observed the decremented drainers count, fired context_free — which frees the memory containing error_status. TSAN reported the race between the exchange and the free because the ordering went through an indirect atomic chain (drainers) that TSAN could not trace. Fix: consume the processor result in the drain function BEFORE the CLOSED fetch_or, while the worker is still a registered drainer. This ensures the exchange on error_status completes before any fetch_sub that could enable the deferred release path. The consumed status is passed directly to compute_item_complete. Workers that lose the CLOSED race discard their snapshot. Co-Authored-By: Claude <noreply@anthropic.com>

benvanik and others added 30 commits March 18, 2026 01:26

benvanik and others added 14 commits March 18, 2026 01:26

benvanik requested review from AWoloszyn and stellaraccident March 18, 2026 08:41

benvanik added runtime Relating to the IREE runtime library hal/cpu Runtime Host/CPU-based HAL backend labels Mar 18, 2026

AWoloszyn approved these changes Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite local_task executor and queue as process-based cooperative scheduler.#23827

Rewrite local_task executor and queue as process-based cooperative scheduler.#23827
benvanik wants to merge 44 commits intousers/benvanik/cpu-1from
users/benvanik/cpu-2

benvanik commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

benvanik commented Mar 18, 2026

Architecture

Performance

Correctness

What was removed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants