Misc runtime fixes and optimizations. by benvanik · Pull Request #23825 · iree-org/iree

benvanik · 2026-03-18T08:32:35Z

Miscellaneous fixes found while working on the task system rewrite. Each commit is independent and touches a different subsystem.

Atomic slist fast-path empty check

The worker pump loop polls empty slists on every iteration. Previously every poll took the mutex — even when the list was obviously empty. Added a relaxed atomic load of the head pointer as a fast-path: if NULL, return immediately without touching the mutex. Under contention benchmarks this scales perfectly from 1 to 128 threads (~1ns per pop regardless of thread count), whereas the mutex path degrades to ~20us at 128 threads.

Rewrote the slist benchmark as a comprehensive Google Benchmark suite covering single-threaded baselines, contention scaling, producer/consumer patterns, and the empty fast-path under load.

fd_file robustness

Three fixes to file I/O on the synchronous path:

Unaligned memory file import: Memory files wrapping embedded VMFB data (16-byte aligned from FlatBuffer) were rejected by the heap allocator which requires 64-byte alignment. Set IREE_HAL_MEMORY_ACCESS_UNALIGNED on the import params — alignment is a performance preference, not a correctness requirement for host memory.
Platform function size capping: The Windows ReadFile/WriteFile functions take DWORD (32-bit) count parameters, and POSIX pread/pwrite can return -EINVAL for counts exceeding INT_MAX on some kernels. Added defensive INT32_MAX capping in the platform functions themselves rather than relying on caller discipline.
Sync-only degradation: When async file import fails (e.g., platform doesn't support async fd import), the fd_file now silently degrades to synchronous-only mode instead of failing the entire file creation. Async I/O is a performance optimization, not a correctness requirement.

Task system hardening

Worker deinitialize guard: Workers that were never fully initialized (executor creation failed partway through) could crash during cleanup. Added a guard to skip deinitialization of never-initialized workers.
cpu_set_t overflow elimination: The sysfs topology code used cpu_set_t (limited to CPU_SETSIZE=1024 on glibc) to parse processor sharing masks. On machines with >1024 logical CPUs this silently overflowed. Replaced with direct parsing from the CPU list sysfs files into group masks with no processor ID limit.

CUDA event leak

In the CUDA timepoint pool, acquire_device_signal and acquire_device_wait acquired device events before timepoints. If the timepoint acquisition failed (OOM), the already-acquired events were leaked. Fixed to release events on failure.

The slist head pointer is now atomic, enabling a relaxed load before taking the mutex. Empty-list pops return immediately without any futex syscall. This eliminates the common case where workers poll an empty list every pump iteration. The mutex is retained for push/concat/discard (ABA-safe for all users). A task-system-specific lock-free list can replace this for the hot paths where entries are pool-managed and ABA is not a concern. Co-Authored-By: Claude <noreply@anthropic.com>

Memory files wrapping embedded VMFB data (host allocations from FlatBuffer) were failing to import as HAL buffers because the data was 16-byte aligned but the heap allocator requires 64-byte alignment. The import failure left the memory file with no storage_buffer, causing queue.read to fail with "no storage buffer and no async handle" — breaking all VMFBs with embedded parameters on local-task. Fix: set IREE_HAL_MEMORY_ACCESS_UNALIGNED on the import params. This bypasses the alignment check in heap_buffer_wrap, which is a performance preference, not a correctness requirement for host memory. The proper fix is to ensure VMFBs align their embedded data to 64 bytes at compile time. Co-Authored-By: Claude <noreply@anthropic.com>

Defense-in-depth: the synchronous platform functions (iree_hal_platform_fd_pread/pwrite) accepted iree_host_size_t counts but cast directly to DWORD (Windows) or passed unchecked to pread (POSIX). While current callers already cap at INT_MAX, the platform functions should not rely on caller discipline — a future caller passing >4GB would silently truncate on Windows or hit kernel EINVAL on some Linux versions. Co-Authored-By: Claude <noreply@anthropic.com>

Previously, if iree_async_file_import failed (unsupported fd type, platform limitations, etc.), the entire iree_hal_fd_file_from_handle call failed — preventing the file from being used at all. This was overly conservative: async I/O is a performance optimization, not a correctness requirement. Synchronous pread/pwrite is always available as a fallback. Now the async import failure is silently absorbed: the file continues with async_file=NULL and all reads/writes go through the synchronous platform functions. Callers already check supports_synchronous_io() and async_handle() to select the I/O path. Co-Authored-By: Claude <noreply@anthropic.com>

If iree_task_worker_initialize fails partway through (e.g., thread creation failure on worker 5), iree_task_executor_destroy loops over all worker_count workers. Workers beyond the failure point were never initialized — their memory is zero-filled from iree_allocator_malloc but iree_notification_initialize was never called on them. Calling iree_notification_deinitialize on a never-initialized notification invokes pthread_mutex_lock on a never-initialized mutex. On glibc this happens to work (PTHREAD_MUTEX_INITIALIZER is all-zeros) but it is undefined behavior per POSIX and breaks on other implementations. Add an early-out guard checking worker->executor, which is set at the start of iree_task_worker_initialize before notification init. For workers that never entered initialize, it is NULL from the zero-filled allocation. Also update the stale comment at the thread creation failure site to describe the actual safety mechanism. Co-Authored-By: Claude <noreply@anthropic.com>

…rflow). The sysfs topology code used fixed-size cpu_set_t (glibc CPU_SETSIZE=1024) to accumulate processor IDs from shared_cpu_list, then converted to group mask. On machines with >1024 logical CPUs (EPYC Bergamo, multi-socket with sparse APIC IDs), CPU_SET(cpu, ...) with cpu >= 1024 is an out-of-bounds write. Replace the intermediate cpu_set_t with a direct group mask build: the parse callback now scans topology groups for each CPU range in the list, setting bits for groups whose processor_index falls in the range. O(ranges_in_list x group_count) per group — both are small. The domain enumeration path (SCATTER distribution) retains its own cpu_set_t-based functions since it needs processor-level masks for domain grouping. Added a CPU_SETSIZE bounds check there as a safety measure — the broader domain enumeration cpu_set_t issue is tracked separately. Co-Authored-By: Claude <noreply@anthropic.com>

Both acquire_device_signal and acquire_device_wait acquired device events from the pool before acquiring timepoints. If the timepoint acquisition failed (OOM), IREE_RETURN_AND_END_ZONE_IF_ERROR returned without releasing the already-acquired events back to the pool. Fix: catch the timepoint acquisition error, release all acquired events, then return the error. Co-Authored-By: Claude <noreply@anthropic.com>

The cpu_set_t elimination commit referenced iree_task_affinity_set_empty() and iree_task_affinity_set_set_index() which are functions from the affinity widening commit (not yet landed). On main, the affinity set is still a plain uint64_t, so use literal 0 and bitwise-or respectively. Co-Authored-By: Claude <noreply@anthropic.com>

benvanik requested review from AWoloszyn and stellaraccident March 18, 2026 08:32

benvanik added the runtime Relating to the IREE runtime library label Mar 18, 2026

AWoloszyn approved these changes Mar 18, 2026

View reviewed changes

benvanik and others added 7 commits March 18, 2026 20:31

benvanik force-pushed the users/benvanik/cpu-0 branch from a43c638 to 911f7b4 Compare March 19, 2026 03:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc runtime fixes and optimizations.#23825

Misc runtime fixes and optimizations.#23825
benvanik wants to merge 8 commits intomainfrom
users/benvanik/cpu-0

benvanik commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

benvanik commented Mar 18, 2026

Atomic slist fast-path empty check

fd_file robustness

Task system hardening

CUDA event leak

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants