Skip to content

Rewrite local_sync queue operations as direct inline implementations.#23826

Draft
benvanik wants to merge 11 commits intousers/benvanik/cpu-0from
users/benvanik/cpu-1
Draft

Rewrite local_sync queue operations as direct inline implementations.#23826
benvanik wants to merge 11 commits intousers/benvanik/cpu-0from
users/benvanik/cpu-1

Conversation

@benvanik
Copy link
Copy Markdown
Collaborator

The local_sync driver is the synchronous, single-threaded HAL backend — every queue operation blocks the calling thread until completion. Despite this, it was routing operations through heavyweight abstractions designed for async backends:

  • File I/O went through the streaming file transfer machinery (file_transfer.h): staging buffer allocation, transfer worker management, chunked semaphore chains. This caused FdFileReadRangeValidation to deadlock because the error path got tangled in semaphore signaling.

  • Fill, update, copy went through queue emulation (queue_emulation.h): each operation created a one-shot command buffer, recorded a single command, called queue_execute to replay it, then destroyed the command buffer.

  • Dispatch similarly created a throwaway command buffer for each dispatch call.

All of this is unnecessary overhead for a driver that executes everything inline. The fix: replace every queue operation with the same three-line pattern:

IREE_RETURN_IF_ERROR(queue_op_begin(device, wait_semaphores));
iree_status_t status = do_the_thing();
return queue_op_end(device, signal_semaphores, status);

queue_op_begin waits on semaphores. queue_op_end signals on success, fails on error, and advances the frontier tracker. Both are static inline helpers — the compiler eliminates them entirely.

What changed

  • queue_read/write: Direct iree_hal_file_read/write calls with range validation (fixes the deadlock). Memory files still fast-path through queue_copy via the storage buffer.

  • queue_fill/update/copy: Direct iree_hal_buffer_map_fill, map_write, and map-source + map_write respectively. No command buffers.

  • queue_dispatch: New shared utility iree_hal_local_executable_dispatch_inline in hal/local/ that maps bindings, populates the dispatch state, and iterates all workgroups. ~75 lines that replace the command buffer round-trip. Available for local_task's inline dispatch path as a future simplification.

  • begin/end factoring: The 15-line wait/signal/fail/frontier pattern that was copy-pasted across 9 queue operations is now two inline helpers. The simplest operations (dealloca) are three lines of body.

Result

local_sync has zero dependency on file_transfer.h and queue_emulation.h. Net deletion of ~105 lines while adding file I/O support, range validation, inline dispatch, and fixing a deadlock. The driver is now a clear, readable expression of "wait, do the thing, signal" — which is all a synchronous driver should be.

benvanik and others added 11 commits March 18, 2026 01:18
The slist head pointer is now atomic, enabling a relaxed load before
taking the mutex. Empty-list pops return immediately without any
futex syscall. This eliminates the common case where workers poll an
empty list every pump iteration.

The mutex is retained for push/concat/discard (ABA-safe for all users).
A task-system-specific lock-free list can replace this for the hot
paths where entries are pool-managed and ABA is not a concern.

Co-Authored-By: Claude <noreply@anthropic.com>
Memory files wrapping embedded VMFB data (host allocations from
FlatBuffer) were failing to import as HAL buffers because the data
was 16-byte aligned but the heap allocator requires 64-byte alignment.
The import failure left the memory file with no storage_buffer,
causing queue.read to fail with "no storage buffer and no async
handle" — breaking all VMFBs with embedded parameters on local-task.

Fix: set IREE_HAL_MEMORY_ACCESS_UNALIGNED on the import params.
This bypasses the alignment check in heap_buffer_wrap, which is a
performance preference, not a correctness requirement for host memory.
The proper fix is to ensure VMFBs align their embedded data to 64
bytes at compile time.

Co-Authored-By: Claude <noreply@anthropic.com>
Defense-in-depth: the synchronous platform functions
(iree_hal_platform_fd_pread/pwrite) accepted iree_host_size_t counts
but cast directly to DWORD (Windows) or passed unchecked to pread
(POSIX). While current callers already cap at INT_MAX, the platform
functions should not rely on caller discipline — a future caller
passing >4GB would silently truncate on Windows or hit kernel EINVAL
on some Linux versions.

Co-Authored-By: Claude <noreply@anthropic.com>
Previously, if iree_async_file_import failed (unsupported fd type,
platform limitations, etc.), the entire iree_hal_fd_file_from_handle
call failed — preventing the file from being used at all. This was
overly conservative: async I/O is a performance optimization, not a
correctness requirement. Synchronous pread/pwrite is always available
as a fallback.

Now the async import failure is silently absorbed: the file continues
with async_file=NULL and all reads/writes go through the synchronous
platform functions. Callers already check supports_synchronous_io()
and async_handle() to select the I/O path.

Co-Authored-By: Claude <noreply@anthropic.com>
If iree_task_worker_initialize fails partway through (e.g., thread
creation failure on worker 5), iree_task_executor_destroy loops over
all worker_count workers. Workers beyond the failure point were never
initialized — their memory is zero-filled from iree_allocator_malloc
but iree_notification_initialize was never called on them.

Calling iree_notification_deinitialize on a never-initialized
notification invokes pthread_mutex_lock on a never-initialized mutex.
On glibc this happens to work (PTHREAD_MUTEX_INITIALIZER is all-zeros)
but it is undefined behavior per POSIX and breaks on other
implementations.

Add an early-out guard checking worker->executor, which is set at the
start of iree_task_worker_initialize before notification init. For
workers that never entered initialize, it is NULL from the zero-filled
allocation. Also update the stale comment at the thread creation
failure site to describe the actual safety mechanism.

Co-Authored-By: Claude <noreply@anthropic.com>
…rflow).

The sysfs topology code used fixed-size cpu_set_t (glibc CPU_SETSIZE=1024)
to accumulate processor IDs from shared_cpu_list, then converted to group
mask. On machines with >1024 logical CPUs (EPYC Bergamo, multi-socket with
sparse APIC IDs), CPU_SET(cpu, ...) with cpu >= 1024 is an out-of-bounds
write.

Replace the intermediate cpu_set_t with a direct group mask build:
the parse callback now scans topology groups for each CPU range in the
list, setting bits for groups whose processor_index falls in the range.
O(ranges_in_list x group_count) per group — both are small.

The domain enumeration path (SCATTER distribution) retains its own
cpu_set_t-based functions since it needs processor-level masks for
domain grouping. Added a CPU_SETSIZE bounds check there as a safety
measure — the broader domain enumeration cpu_set_t issue is tracked
separately.

Co-Authored-By: Claude <noreply@anthropic.com>
Both acquire_device_signal and acquire_device_wait acquired device
events from the pool before acquiring timepoints. If the timepoint
acquisition failed (OOM), IREE_RETURN_AND_END_ZONE_IF_ERROR returned
without releasing the already-acquired events back to the pool.

Fix: catch the timepoint acquisition error, release all acquired
events, then return the error.

Co-Authored-By: Claude <noreply@anthropic.com>
The sync driver was routing file reads/writes through the streaming
file transfer machinery (file_transfer.h), which allocates staging
buffers, manages transfer workers, and chains semaphore operations.
This was wildly over-engineered for a synchronous driver that
executes everything inline on the calling thread.

Replaced with the same wait-do-signal pattern used by queue_alloca:
wait on input semaphores, call iree_hal_file_read/write directly,
signal output semaphores. Memory files still take the fast path
through queue_copy via the storage buffer.

This fixes the FdFileReadRangeValidation deadlock (bd-12x): range
validation now happens before entering any semaphore machinery,
returning OUT_OF_RANGE immediately for reads past EOF. Removed
the xfail entry and the file_transfer.h dependency.

bd-3kg (AllocaWithWaitSemaphores) remains a valid xfail: the test
spawns the signaling thread after queue_alloca, which blocks
synchronously on local_sync.

Co-Authored-By: Claude <noreply@anthropic.com>
…al_sync.

The sync driver was wrapping fill, update, and copy operations in
one-shot command buffers via queue_emulation, then executing them
through queue_execute. This creates and destroys a command buffer
for each operation — unnecessary overhead for a synchronous driver.

Replaced with direct inline implementations:
- fill: iree_hal_buffer_map_fill (map, memset pattern, unmap)
- update: iree_hal_buffer_map_write (map target, memcpy, unmap)
- copy: map source, iree_hal_buffer_map_write to target, unmap

All follow the same wait-do-signal pattern as queue_alloca and the
queue_read/write rewrite from the previous commit. Dispatch remains
emulated (it genuinely needs command buffer infrastructure).

Co-Authored-By: Claude <noreply@anthropic.com>
Extracts the core single-threaded dispatch logic into a shared utility
in hal/local/inline_dispatch.h: map bindings, populate dispatch state,
iterate all workgroups via issue_dispatch_inline, unmap. ~75 lines of
straightforward code that replaces the command buffer round-trip.

local_sync now uses this directly for queue_dispatch instead of the
emulated path (which created a command buffer, recorded one dispatch,
then called queue_execute to replay it). This eliminates all local_sync
dependencies on both queue_emulation and file_transfer — every queue
operation is now a direct inline implementation.

The inline_dispatch utility is also available for local_task's inline
dispatch path (ALLOW_INLINE_EXECUTION) as a future simplification.

Co-Authored-By: Claude <noreply@anthropic.com>
Every queue operation (alloca, dealloca, fill, update, copy, read,
write, dispatch, execute) followed the same 15-line pattern: wait on
semaphores, do one thing, signal-or-fail semaphores, advance frontier.
The only variation was the 1-5 lines in the middle.

Extracted queue_op_begin (semaphore wait) and queue_op_end (signal/fail
+ frontier advance) as static inline helpers. Each queue operation is
now: IREE_RETURN_IF_ERROR(begin), do the thing, tail-call end. The
simplest ops (dealloca) are three lines of body.

Co-Authored-By: Claude <noreply@anthropic.com>
@benvanik benvanik added the hal/cpu Runtime Host/CPU-based HAL backend label Mar 18, 2026
@benvanik benvanik changed the base branch from main to users/benvanik/cpu-0 March 18, 2026 08:34
@benvanik benvanik force-pushed the users/benvanik/cpu-0 branch from a43c638 to 911f7b4 Compare March 19, 2026 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hal/cpu Runtime Host/CPU-based HAL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants