Skip to content

Misc runtime fixes and optimizations.#23825

Draft
benvanik wants to merge 8 commits intomainfrom
users/benvanik/cpu-0
Draft

Misc runtime fixes and optimizations.#23825
benvanik wants to merge 8 commits intomainfrom
users/benvanik/cpu-0

Conversation

@benvanik
Copy link
Copy Markdown
Collaborator

Miscellaneous fixes found while working on the task system rewrite. Each commit is independent and touches a different subsystem.

Atomic slist fast-path empty check

The worker pump loop polls empty slists on every iteration. Previously every poll took the mutex — even when the list was obviously empty. Added a relaxed atomic load of the head pointer as a fast-path: if NULL, return immediately without touching the mutex. Under contention benchmarks this scales perfectly from 1 to 128 threads (~1ns per pop regardless of thread count), whereas the mutex path degrades to ~20us at 128 threads.

Rewrote the slist benchmark as a comprehensive Google Benchmark suite covering single-threaded baselines, contention scaling, producer/consumer patterns, and the empty fast-path under load.

fd_file robustness

Three fixes to file I/O on the synchronous path:

  • Unaligned memory file import: Memory files wrapping embedded VMFB data (16-byte aligned from FlatBuffer) were rejected by the heap allocator which requires 64-byte alignment. Set IREE_HAL_MEMORY_ACCESS_UNALIGNED on the import params — alignment is a performance preference, not a correctness requirement for host memory.

  • Platform function size capping: The Windows ReadFile/WriteFile functions take DWORD (32-bit) count parameters, and POSIX pread/pwrite can return -EINVAL for counts exceeding INT_MAX on some kernels. Added defensive INT32_MAX capping in the platform functions themselves rather than relying on caller discipline.

  • Sync-only degradation: When async file import fails (e.g., platform doesn't support async fd import), the fd_file now silently degrades to synchronous-only mode instead of failing the entire file creation. Async I/O is a performance optimization, not a correctness requirement.

Task system hardening

  • Worker deinitialize guard: Workers that were never fully initialized (executor creation failed partway through) could crash during cleanup. Added a guard to skip deinitialization of never-initialized workers.

  • cpu_set_t overflow elimination: The sysfs topology code used cpu_set_t (limited to CPU_SETSIZE=1024 on glibc) to parse processor sharing masks. On machines with >1024 logical CPUs this silently overflowed. Replaced with direct parsing from the CPU list sysfs files into group masks with no processor ID limit.

CUDA event leak

In the CUDA timepoint pool, acquire_device_signal and acquire_device_wait acquired device events before timepoints. If the timepoint acquisition failed (OOM), the already-acquired events were leaked. Fixed to release events on failure.

@benvanik benvanik added the runtime Relating to the IREE runtime library label Mar 18, 2026
benvanik and others added 7 commits March 18, 2026 20:31
The slist head pointer is now atomic, enabling a relaxed load before
taking the mutex. Empty-list pops return immediately without any
futex syscall. This eliminates the common case where workers poll an
empty list every pump iteration.

The mutex is retained for push/concat/discard (ABA-safe for all users).
A task-system-specific lock-free list can replace this for the hot
paths where entries are pool-managed and ABA is not a concern.

Co-Authored-By: Claude <noreply@anthropic.com>
Memory files wrapping embedded VMFB data (host allocations from
FlatBuffer) were failing to import as HAL buffers because the data
was 16-byte aligned but the heap allocator requires 64-byte alignment.
The import failure left the memory file with no storage_buffer,
causing queue.read to fail with "no storage buffer and no async
handle" — breaking all VMFBs with embedded parameters on local-task.

Fix: set IREE_HAL_MEMORY_ACCESS_UNALIGNED on the import params.
This bypasses the alignment check in heap_buffer_wrap, which is a
performance preference, not a correctness requirement for host memory.
The proper fix is to ensure VMFBs align their embedded data to 64
bytes at compile time.

Co-Authored-By: Claude <noreply@anthropic.com>
Defense-in-depth: the synchronous platform functions
(iree_hal_platform_fd_pread/pwrite) accepted iree_host_size_t counts
but cast directly to DWORD (Windows) or passed unchecked to pread
(POSIX). While current callers already cap at INT_MAX, the platform
functions should not rely on caller discipline — a future caller
passing >4GB would silently truncate on Windows or hit kernel EINVAL
on some Linux versions.

Co-Authored-By: Claude <noreply@anthropic.com>
Previously, if iree_async_file_import failed (unsupported fd type,
platform limitations, etc.), the entire iree_hal_fd_file_from_handle
call failed — preventing the file from being used at all. This was
overly conservative: async I/O is a performance optimization, not a
correctness requirement. Synchronous pread/pwrite is always available
as a fallback.

Now the async import failure is silently absorbed: the file continues
with async_file=NULL and all reads/writes go through the synchronous
platform functions. Callers already check supports_synchronous_io()
and async_handle() to select the I/O path.

Co-Authored-By: Claude <noreply@anthropic.com>
If iree_task_worker_initialize fails partway through (e.g., thread
creation failure on worker 5), iree_task_executor_destroy loops over
all worker_count workers. Workers beyond the failure point were never
initialized — their memory is zero-filled from iree_allocator_malloc
but iree_notification_initialize was never called on them.

Calling iree_notification_deinitialize on a never-initialized
notification invokes pthread_mutex_lock on a never-initialized mutex.
On glibc this happens to work (PTHREAD_MUTEX_INITIALIZER is all-zeros)
but it is undefined behavior per POSIX and breaks on other
implementations.

Add an early-out guard checking worker->executor, which is set at the
start of iree_task_worker_initialize before notification init. For
workers that never entered initialize, it is NULL from the zero-filled
allocation. Also update the stale comment at the thread creation
failure site to describe the actual safety mechanism.

Co-Authored-By: Claude <noreply@anthropic.com>
…rflow).

The sysfs topology code used fixed-size cpu_set_t (glibc CPU_SETSIZE=1024)
to accumulate processor IDs from shared_cpu_list, then converted to group
mask. On machines with >1024 logical CPUs (EPYC Bergamo, multi-socket with
sparse APIC IDs), CPU_SET(cpu, ...) with cpu >= 1024 is an out-of-bounds
write.

Replace the intermediate cpu_set_t with a direct group mask build:
the parse callback now scans topology groups for each CPU range in the
list, setting bits for groups whose processor_index falls in the range.
O(ranges_in_list x group_count) per group — both are small.

The domain enumeration path (SCATTER distribution) retains its own
cpu_set_t-based functions since it needs processor-level masks for
domain grouping. Added a CPU_SETSIZE bounds check there as a safety
measure — the broader domain enumeration cpu_set_t issue is tracked
separately.

Co-Authored-By: Claude <noreply@anthropic.com>
Both acquire_device_signal and acquire_device_wait acquired device
events from the pool before acquiring timepoints. If the timepoint
acquisition failed (OOM), IREE_RETURN_AND_END_ZONE_IF_ERROR returned
without releasing the already-acquired events back to the pool.

Fix: catch the timepoint acquisition error, release all acquired
events, then return the error.

Co-Authored-By: Claude <noreply@anthropic.com>
@benvanik benvanik force-pushed the users/benvanik/cpu-0 branch from a43c638 to 911f7b4 Compare March 19, 2026 03:37
The cpu_set_t elimination commit referenced iree_task_affinity_set_empty()
and iree_task_affinity_set_set_index() which are functions from the
affinity widening commit (not yet landed). On main, the affinity set is
still a plain uint64_t, so use literal 0 and bitwise-or respectively.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

runtime Relating to the IREE runtime library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants