Rewrite local_sync queue operations as direct inline implementations. by benvanik · Pull Request #23826 · iree-org/iree

benvanik · 2026-03-18T08:34:08Z

The local_sync driver is the synchronous, single-threaded HAL backend — every queue operation blocks the calling thread until completion. Despite this, it was routing operations through heavyweight abstractions designed for async backends:

File I/O went through the streaming file transfer machinery (file_transfer.h): staging buffer allocation, transfer worker management, chunked semaphore chains. This caused FdFileReadRangeValidation to deadlock because the error path got tangled in semaphore signaling.
Fill, update, copy went through queue emulation (queue_emulation.h): each operation created a one-shot command buffer, recorded a single command, called queue_execute to replay it, then destroyed the command buffer.
Dispatch similarly created a throwaway command buffer for each dispatch call.

All of this is unnecessary overhead for a driver that executes everything inline. The fix: replace every queue operation with the same three-line pattern:

IREE_RETURN_IF_ERROR(queue_op_begin(device, wait_semaphores));
iree_status_t status = do_the_thing();
return queue_op_end(device, signal_semaphores, status);

queue_op_begin waits on semaphores. queue_op_end signals on success, fails on error, and advances the frontier tracker. Both are static inline helpers — the compiler eliminates them entirely.

What changed

queue_read/write: Direct iree_hal_file_read/write calls with range validation (fixes the deadlock). Memory files still fast-path through queue_copy via the storage buffer.
queue_fill/update/copy: Direct iree_hal_buffer_map_fill, map_write, and map-source + map_write respectively. No command buffers.
queue_dispatch: New shared utility iree_hal_local_executable_dispatch_inline in hal/local/ that maps bindings, populates the dispatch state, and iterates all workgroups. ~75 lines that replace the command buffer round-trip. Available for local_task's inline dispatch path as a future simplification.
begin/end factoring: The 15-line wait/signal/fail/frontier pattern that was copy-pasted across 9 queue operations is now two inline helpers. The simplest operations (dealloca) are three lines of body.

Result

local_sync has zero dependency on file_transfer.h and queue_emulation.h. Net deletion of ~105 lines while adding file I/O support, range validation, inline dispatch, and fixing a deadlock. The driver is now a clear, readable expression of "wait, do the thing, signal" — which is all a synchronous driver should be.

The slist head pointer is now atomic, enabling a relaxed load before taking the mutex. Empty-list pops return immediately without any futex syscall. This eliminates the common case where workers poll an empty list every pump iteration. The mutex is retained for push/concat/discard (ABA-safe for all users). A task-system-specific lock-free list can replace this for the hot paths where entries are pool-managed and ABA is not a concern. Co-Authored-By: Claude <noreply@anthropic.com>

Memory files wrapping embedded VMFB data (host allocations from FlatBuffer) were failing to import as HAL buffers because the data was 16-byte aligned but the heap allocator requires 64-byte alignment. The import failure left the memory file with no storage_buffer, causing queue.read to fail with "no storage buffer and no async handle" — breaking all VMFBs with embedded parameters on local-task. Fix: set IREE_HAL_MEMORY_ACCESS_UNALIGNED on the import params. This bypasses the alignment check in heap_buffer_wrap, which is a performance preference, not a correctness requirement for host memory. The proper fix is to ensure VMFBs align their embedded data to 64 bytes at compile time. Co-Authored-By: Claude <noreply@anthropic.com>

Defense-in-depth: the synchronous platform functions (iree_hal_platform_fd_pread/pwrite) accepted iree_host_size_t counts but cast directly to DWORD (Windows) or passed unchecked to pread (POSIX). While current callers already cap at INT_MAX, the platform functions should not rely on caller discipline — a future caller passing >4GB would silently truncate on Windows or hit kernel EINVAL on some Linux versions. Co-Authored-By: Claude <noreply@anthropic.com>

Previously, if iree_async_file_import failed (unsupported fd type, platform limitations, etc.), the entire iree_hal_fd_file_from_handle call failed — preventing the file from being used at all. This was overly conservative: async I/O is a performance optimization, not a correctness requirement. Synchronous pread/pwrite is always available as a fallback. Now the async import failure is silently absorbed: the file continues with async_file=NULL and all reads/writes go through the synchronous platform functions. Callers already check supports_synchronous_io() and async_handle() to select the I/O path. Co-Authored-By: Claude <noreply@anthropic.com>

If iree_task_worker_initialize fails partway through (e.g., thread creation failure on worker 5), iree_task_executor_destroy loops over all worker_count workers. Workers beyond the failure point were never initialized — their memory is zero-filled from iree_allocator_malloc but iree_notification_initialize was never called on them. Calling iree_notification_deinitialize on a never-initialized notification invokes pthread_mutex_lock on a never-initialized mutex. On glibc this happens to work (PTHREAD_MUTEX_INITIALIZER is all-zeros) but it is undefined behavior per POSIX and breaks on other implementations. Add an early-out guard checking worker->executor, which is set at the start of iree_task_worker_initialize before notification init. For workers that never entered initialize, it is NULL from the zero-filled allocation. Also update the stale comment at the thread creation failure site to describe the actual safety mechanism. Co-Authored-By: Claude <noreply@anthropic.com>

…rflow). The sysfs topology code used fixed-size cpu_set_t (glibc CPU_SETSIZE=1024) to accumulate processor IDs from shared_cpu_list, then converted to group mask. On machines with >1024 logical CPUs (EPYC Bergamo, multi-socket with sparse APIC IDs), CPU_SET(cpu, ...) with cpu >= 1024 is an out-of-bounds write. Replace the intermediate cpu_set_t with a direct group mask build: the parse callback now scans topology groups for each CPU range in the list, setting bits for groups whose processor_index falls in the range. O(ranges_in_list x group_count) per group — both are small. The domain enumeration path (SCATTER distribution) retains its own cpu_set_t-based functions since it needs processor-level masks for domain grouping. Added a CPU_SETSIZE bounds check there as a safety measure — the broader domain enumeration cpu_set_t issue is tracked separately. Co-Authored-By: Claude <noreply@anthropic.com>

Both acquire_device_signal and acquire_device_wait acquired device events from the pool before acquiring timepoints. If the timepoint acquisition failed (OOM), IREE_RETURN_AND_END_ZONE_IF_ERROR returned without releasing the already-acquired events back to the pool. Fix: catch the timepoint acquisition error, release all acquired events, then return the error. Co-Authored-By: Claude <noreply@anthropic.com>

The sync driver was routing file reads/writes through the streaming file transfer machinery (file_transfer.h), which allocates staging buffers, manages transfer workers, and chains semaphore operations. This was wildly over-engineered for a synchronous driver that executes everything inline on the calling thread. Replaced with the same wait-do-signal pattern used by queue_alloca: wait on input semaphores, call iree_hal_file_read/write directly, signal output semaphores. Memory files still take the fast path through queue_copy via the storage buffer. This fixes the FdFileReadRangeValidation deadlock (bd-12x): range validation now happens before entering any semaphore machinery, returning OUT_OF_RANGE immediately for reads past EOF. Removed the xfail entry and the file_transfer.h dependency. bd-3kg (AllocaWithWaitSemaphores) remains a valid xfail: the test spawns the signaling thread after queue_alloca, which blocks synchronously on local_sync. Co-Authored-By: Claude <noreply@anthropic.com>

…al_sync. The sync driver was wrapping fill, update, and copy operations in one-shot command buffers via queue_emulation, then executing them through queue_execute. This creates and destroys a command buffer for each operation — unnecessary overhead for a synchronous driver. Replaced with direct inline implementations: - fill: iree_hal_buffer_map_fill (map, memset pattern, unmap) - update: iree_hal_buffer_map_write (map target, memcpy, unmap) - copy: map source, iree_hal_buffer_map_write to target, unmap All follow the same wait-do-signal pattern as queue_alloca and the queue_read/write rewrite from the previous commit. Dispatch remains emulated (it genuinely needs command buffer infrastructure). Co-Authored-By: Claude <noreply@anthropic.com>

Extracts the core single-threaded dispatch logic into a shared utility in hal/local/inline_dispatch.h: map bindings, populate dispatch state, iterate all workgroups via issue_dispatch_inline, unmap. ~75 lines of straightforward code that replaces the command buffer round-trip. local_sync now uses this directly for queue_dispatch instead of the emulated path (which created a command buffer, recorded one dispatch, then called queue_execute to replay it). This eliminates all local_sync dependencies on both queue_emulation and file_transfer — every queue operation is now a direct inline implementation. The inline_dispatch utility is also available for local_task's inline dispatch path (ALLOW_INLINE_EXECUTION) as a future simplification. Co-Authored-By: Claude <noreply@anthropic.com>

Every queue operation (alloca, dealloca, fill, update, copy, read, write, dispatch, execute) followed the same 15-line pattern: wait on semaphores, do one thing, signal-or-fail semaphores, advance frontier. The only variation was the 1-5 lines in the middle. Extracted queue_op_begin (semaphore wait) and queue_op_end (signal/fail + frontier advance) as static inline helpers. Each queue operation is now: IREE_RETURN_IF_ERROR(begin), do the thing, tail-call end. The simplest ops (dealloca) are three lines of body. Co-Authored-By: Claude <noreply@anthropic.com>

benvanik and others added 11 commits March 18, 2026 01:18

benvanik requested review from AWoloszyn and stellaraccident March 18, 2026 08:34

benvanik added the hal/cpu Runtime Host/CPU-based HAL backend label Mar 18, 2026

benvanik changed the base branch from main to users/benvanik/cpu-0 March 18, 2026 08:34

AWoloszyn approved these changes Mar 18, 2026

View reviewed changes

benvanik force-pushed the users/benvanik/cpu-0 branch from a43c638 to 911f7b4 Compare March 19, 2026 03:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite local_sync queue operations as direct inline implementations.#23826

Rewrite local_sync queue operations as direct inline implementations.#23826
benvanik wants to merge 11 commits intousers/benvanik/cpu-0from
users/benvanik/cpu-1

benvanik commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

benvanik commented Mar 18, 2026

What changed

Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants