Rewrite local_sync queue operations as direct inline implementations.#23826
Draft
benvanik wants to merge 11 commits intousers/benvanik/cpu-0from
Draft
Rewrite local_sync queue operations as direct inline implementations.#23826benvanik wants to merge 11 commits intousers/benvanik/cpu-0from
benvanik wants to merge 11 commits intousers/benvanik/cpu-0from
Conversation
The slist head pointer is now atomic, enabling a relaxed load before taking the mutex. Empty-list pops return immediately without any futex syscall. This eliminates the common case where workers poll an empty list every pump iteration. The mutex is retained for push/concat/discard (ABA-safe for all users). A task-system-specific lock-free list can replace this for the hot paths where entries are pool-managed and ABA is not a concern. Co-Authored-By: Claude <noreply@anthropic.com>
Memory files wrapping embedded VMFB data (host allocations from FlatBuffer) were failing to import as HAL buffers because the data was 16-byte aligned but the heap allocator requires 64-byte alignment. The import failure left the memory file with no storage_buffer, causing queue.read to fail with "no storage buffer and no async handle" — breaking all VMFBs with embedded parameters on local-task. Fix: set IREE_HAL_MEMORY_ACCESS_UNALIGNED on the import params. This bypasses the alignment check in heap_buffer_wrap, which is a performance preference, not a correctness requirement for host memory. The proper fix is to ensure VMFBs align their embedded data to 64 bytes at compile time. Co-Authored-By: Claude <noreply@anthropic.com>
Defense-in-depth: the synchronous platform functions (iree_hal_platform_fd_pread/pwrite) accepted iree_host_size_t counts but cast directly to DWORD (Windows) or passed unchecked to pread (POSIX). While current callers already cap at INT_MAX, the platform functions should not rely on caller discipline — a future caller passing >4GB would silently truncate on Windows or hit kernel EINVAL on some Linux versions. Co-Authored-By: Claude <noreply@anthropic.com>
Previously, if iree_async_file_import failed (unsupported fd type, platform limitations, etc.), the entire iree_hal_fd_file_from_handle call failed — preventing the file from being used at all. This was overly conservative: async I/O is a performance optimization, not a correctness requirement. Synchronous pread/pwrite is always available as a fallback. Now the async import failure is silently absorbed: the file continues with async_file=NULL and all reads/writes go through the synchronous platform functions. Callers already check supports_synchronous_io() and async_handle() to select the I/O path. Co-Authored-By: Claude <noreply@anthropic.com>
If iree_task_worker_initialize fails partway through (e.g., thread creation failure on worker 5), iree_task_executor_destroy loops over all worker_count workers. Workers beyond the failure point were never initialized — their memory is zero-filled from iree_allocator_malloc but iree_notification_initialize was never called on them. Calling iree_notification_deinitialize on a never-initialized notification invokes pthread_mutex_lock on a never-initialized mutex. On glibc this happens to work (PTHREAD_MUTEX_INITIALIZER is all-zeros) but it is undefined behavior per POSIX and breaks on other implementations. Add an early-out guard checking worker->executor, which is set at the start of iree_task_worker_initialize before notification init. For workers that never entered initialize, it is NULL from the zero-filled allocation. Also update the stale comment at the thread creation failure site to describe the actual safety mechanism. Co-Authored-By: Claude <noreply@anthropic.com>
…rflow). The sysfs topology code used fixed-size cpu_set_t (glibc CPU_SETSIZE=1024) to accumulate processor IDs from shared_cpu_list, then converted to group mask. On machines with >1024 logical CPUs (EPYC Bergamo, multi-socket with sparse APIC IDs), CPU_SET(cpu, ...) with cpu >= 1024 is an out-of-bounds write. Replace the intermediate cpu_set_t with a direct group mask build: the parse callback now scans topology groups for each CPU range in the list, setting bits for groups whose processor_index falls in the range. O(ranges_in_list x group_count) per group — both are small. The domain enumeration path (SCATTER distribution) retains its own cpu_set_t-based functions since it needs processor-level masks for domain grouping. Added a CPU_SETSIZE bounds check there as a safety measure — the broader domain enumeration cpu_set_t issue is tracked separately. Co-Authored-By: Claude <noreply@anthropic.com>
Both acquire_device_signal and acquire_device_wait acquired device events from the pool before acquiring timepoints. If the timepoint acquisition failed (OOM), IREE_RETURN_AND_END_ZONE_IF_ERROR returned without releasing the already-acquired events back to the pool. Fix: catch the timepoint acquisition error, release all acquired events, then return the error. Co-Authored-By: Claude <noreply@anthropic.com>
The sync driver was routing file reads/writes through the streaming file transfer machinery (file_transfer.h), which allocates staging buffers, manages transfer workers, and chains semaphore operations. This was wildly over-engineered for a synchronous driver that executes everything inline on the calling thread. Replaced with the same wait-do-signal pattern used by queue_alloca: wait on input semaphores, call iree_hal_file_read/write directly, signal output semaphores. Memory files still take the fast path through queue_copy via the storage buffer. This fixes the FdFileReadRangeValidation deadlock (bd-12x): range validation now happens before entering any semaphore machinery, returning OUT_OF_RANGE immediately for reads past EOF. Removed the xfail entry and the file_transfer.h dependency. bd-3kg (AllocaWithWaitSemaphores) remains a valid xfail: the test spawns the signaling thread after queue_alloca, which blocks synchronously on local_sync. Co-Authored-By: Claude <noreply@anthropic.com>
…al_sync. The sync driver was wrapping fill, update, and copy operations in one-shot command buffers via queue_emulation, then executing them through queue_execute. This creates and destroys a command buffer for each operation — unnecessary overhead for a synchronous driver. Replaced with direct inline implementations: - fill: iree_hal_buffer_map_fill (map, memset pattern, unmap) - update: iree_hal_buffer_map_write (map target, memcpy, unmap) - copy: map source, iree_hal_buffer_map_write to target, unmap All follow the same wait-do-signal pattern as queue_alloca and the queue_read/write rewrite from the previous commit. Dispatch remains emulated (it genuinely needs command buffer infrastructure). Co-Authored-By: Claude <noreply@anthropic.com>
Extracts the core single-threaded dispatch logic into a shared utility in hal/local/inline_dispatch.h: map bindings, populate dispatch state, iterate all workgroups via issue_dispatch_inline, unmap. ~75 lines of straightforward code that replaces the command buffer round-trip. local_sync now uses this directly for queue_dispatch instead of the emulated path (which created a command buffer, recorded one dispatch, then called queue_execute to replay it). This eliminates all local_sync dependencies on both queue_emulation and file_transfer — every queue operation is now a direct inline implementation. The inline_dispatch utility is also available for local_task's inline dispatch path (ALLOW_INLINE_EXECUTION) as a future simplification. Co-Authored-By: Claude <noreply@anthropic.com>
Every queue operation (alloca, dealloca, fill, update, copy, read, write, dispatch, execute) followed the same 15-line pattern: wait on semaphores, do one thing, signal-or-fail semaphores, advance frontier. The only variation was the 1-5 lines in the middle. Extracted queue_op_begin (semaphore wait) and queue_op_end (signal/fail + frontier advance) as static inline helpers. Each queue operation is now: IREE_RETURN_IF_ERROR(begin), do the thing, tail-call end. The simplest ops (dealloca) are three lines of body. Co-Authored-By: Claude <noreply@anthropic.com>
AWoloszyn
approved these changes
Mar 18, 2026
a43c638 to
911f7b4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The local_sync driver is the synchronous, single-threaded HAL backend — every queue operation blocks the calling thread until completion. Despite this, it was routing operations through heavyweight abstractions designed for async backends:
File I/O went through the streaming file transfer machinery (
file_transfer.h): staging buffer allocation, transfer worker management, chunked semaphore chains. This causedFdFileReadRangeValidationto deadlock because the error path got tangled in semaphore signaling.Fill, update, copy went through queue emulation (
queue_emulation.h): each operation created a one-shot command buffer, recorded a single command, calledqueue_executeto replay it, then destroyed the command buffer.Dispatch similarly created a throwaway command buffer for each dispatch call.
All of this is unnecessary overhead for a driver that executes everything inline. The fix: replace every queue operation with the same three-line pattern:
queue_op_beginwaits on semaphores.queue_op_endsignals on success, fails on error, and advances the frontier tracker. Both arestatic inlinehelpers — the compiler eliminates them entirely.What changed
queue_read/write: Direct
iree_hal_file_read/writecalls with range validation (fixes the deadlock). Memory files still fast-path throughqueue_copyvia the storage buffer.queue_fill/update/copy: Direct
iree_hal_buffer_map_fill,map_write, and map-source +map_writerespectively. No command buffers.queue_dispatch: New shared utility
iree_hal_local_executable_dispatch_inlineinhal/local/that maps bindings, populates the dispatch state, and iterates all workgroups. ~75 lines that replace the command buffer round-trip. Available for local_task's inline dispatch path as a future simplification.begin/end factoring: The 15-line wait/signal/fail/frontier pattern that was copy-pasted across 9 queue operations is now two inline helpers. The simplest operations (dealloca) are three lines of body.
Result
local_sync has zero dependency on
file_transfer.handqueue_emulation.h. Net deletion of ~105 lines while adding file I/O support, range validation, inline dispatch, and fixing a deadlock. The driver is now a clear, readable expression of "wait, do the thing, signal" — which is all a synchronous driver should be.