Draft
Conversation
AWoloszyn
approved these changes
Mar 18, 2026
The slist head pointer is now atomic, enabling a relaxed load before taking the mutex. Empty-list pops return immediately without any futex syscall. This eliminates the common case where workers poll an empty list every pump iteration. The mutex is retained for push/concat/discard (ABA-safe for all users). A task-system-specific lock-free list can replace this for the hot paths where entries are pool-managed and ABA is not a concern. Co-Authored-By: Claude <noreply@anthropic.com>
Memory files wrapping embedded VMFB data (host allocations from FlatBuffer) were failing to import as HAL buffers because the data was 16-byte aligned but the heap allocator requires 64-byte alignment. The import failure left the memory file with no storage_buffer, causing queue.read to fail with "no storage buffer and no async handle" — breaking all VMFBs with embedded parameters on local-task. Fix: set IREE_HAL_MEMORY_ACCESS_UNALIGNED on the import params. This bypasses the alignment check in heap_buffer_wrap, which is a performance preference, not a correctness requirement for host memory. The proper fix is to ensure VMFBs align their embedded data to 64 bytes at compile time. Co-Authored-By: Claude <noreply@anthropic.com>
Defense-in-depth: the synchronous platform functions (iree_hal_platform_fd_pread/pwrite) accepted iree_host_size_t counts but cast directly to DWORD (Windows) or passed unchecked to pread (POSIX). While current callers already cap at INT_MAX, the platform functions should not rely on caller discipline — a future caller passing >4GB would silently truncate on Windows or hit kernel EINVAL on some Linux versions. Co-Authored-By: Claude <noreply@anthropic.com>
Previously, if iree_async_file_import failed (unsupported fd type, platform limitations, etc.), the entire iree_hal_fd_file_from_handle call failed — preventing the file from being used at all. This was overly conservative: async I/O is a performance optimization, not a correctness requirement. Synchronous pread/pwrite is always available as a fallback. Now the async import failure is silently absorbed: the file continues with async_file=NULL and all reads/writes go through the synchronous platform functions. Callers already check supports_synchronous_io() and async_handle() to select the I/O path. Co-Authored-By: Claude <noreply@anthropic.com>
If iree_task_worker_initialize fails partway through (e.g., thread creation failure on worker 5), iree_task_executor_destroy loops over all worker_count workers. Workers beyond the failure point were never initialized — their memory is zero-filled from iree_allocator_malloc but iree_notification_initialize was never called on them. Calling iree_notification_deinitialize on a never-initialized notification invokes pthread_mutex_lock on a never-initialized mutex. On glibc this happens to work (PTHREAD_MUTEX_INITIALIZER is all-zeros) but it is undefined behavior per POSIX and breaks on other implementations. Add an early-out guard checking worker->executor, which is set at the start of iree_task_worker_initialize before notification init. For workers that never entered initialize, it is NULL from the zero-filled allocation. Also update the stale comment at the thread creation failure site to describe the actual safety mechanism. Co-Authored-By: Claude <noreply@anthropic.com>
…rflow). The sysfs topology code used fixed-size cpu_set_t (glibc CPU_SETSIZE=1024) to accumulate processor IDs from shared_cpu_list, then converted to group mask. On machines with >1024 logical CPUs (EPYC Bergamo, multi-socket with sparse APIC IDs), CPU_SET(cpu, ...) with cpu >= 1024 is an out-of-bounds write. Replace the intermediate cpu_set_t with a direct group mask build: the parse callback now scans topology groups for each CPU range in the list, setting bits for groups whose processor_index falls in the range. O(ranges_in_list x group_count) per group — both are small. The domain enumeration path (SCATTER distribution) retains its own cpu_set_t-based functions since it needs processor-level masks for domain grouping. Added a CPU_SETSIZE bounds check there as a safety measure — the broader domain enumeration cpu_set_t issue is tracked separately. Co-Authored-By: Claude <noreply@anthropic.com>
Both acquire_device_signal and acquire_device_wait acquired device events from the pool before acquiring timepoints. If the timepoint acquisition failed (OOM), IREE_RETURN_AND_END_ZONE_IF_ERROR returned without releasing the already-acquired events back to the pool. Fix: catch the timepoint acquisition error, release all acquired events, then return the error. Co-Authored-By: Claude <noreply@anthropic.com>
a43c638 to
911f7b4
Compare
The cpu_set_t elimination commit referenced iree_task_affinity_set_empty() and iree_task_affinity_set_set_index() which are functions from the affinity widening commit (not yet landed). On main, the affinity set is still a plain uint64_t, so use literal 0 and bitwise-or respectively. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Miscellaneous fixes found while working on the task system rewrite. Each commit is independent and touches a different subsystem.
Atomic slist fast-path empty check
The worker pump loop polls empty slists on every iteration. Previously every poll took the mutex — even when the list was obviously empty. Added a relaxed atomic load of the head pointer as a fast-path: if NULL, return immediately without touching the mutex. Under contention benchmarks this scales perfectly from 1 to 128 threads (~1ns per pop regardless of thread count), whereas the mutex path degrades to ~20us at 128 threads.
Rewrote the slist benchmark as a comprehensive Google Benchmark suite covering single-threaded baselines, contention scaling, producer/consumer patterns, and the empty fast-path under load.
fd_file robustness
Three fixes to file I/O on the synchronous path:
Unaligned memory file import: Memory files wrapping embedded VMFB data (16-byte aligned from FlatBuffer) were rejected by the heap allocator which requires 64-byte alignment. Set IREE_HAL_MEMORY_ACCESS_UNALIGNED on the import params — alignment is a performance preference, not a correctness requirement for host memory.
Platform function size capping: The Windows
ReadFile/WriteFilefunctions take DWORD (32-bit) count parameters, and POSIXpread/pwritecan return -EINVAL for counts exceeding INT_MAX on some kernels. Added defensive INT32_MAX capping in the platform functions themselves rather than relying on caller discipline.Sync-only degradation: When async file import fails (e.g., platform doesn't support async fd import), the fd_file now silently degrades to synchronous-only mode instead of failing the entire file creation. Async I/O is a performance optimization, not a correctness requirement.
Task system hardening
Worker deinitialize guard: Workers that were never fully initialized (executor creation failed partway through) could crash during cleanup. Added a guard to skip deinitialization of never-initialized workers.
cpu_set_t overflow elimination: The sysfs topology code used
cpu_set_t(limited toCPU_SETSIZE=1024on glibc) to parse processor sharing masks. On machines with >1024 logical CPUs this silently overflowed. Replaced with direct parsing from the CPU list sysfs files into group masks with no processor ID limit.CUDA event leak
In the CUDA timepoint pool,
acquire_device_signalandacquire_device_waitacquired device events before timepoints. If the timepoint acquisition failed (OOM), the already-acquired events were leaked. Fixed to release events on failure.