-
Notifications
You must be signed in to change notification settings - Fork 268
Triple buffering: Bind Virtual Resource Budget to Physical Memory Allocation [databricks] #13972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
3d40aa4 to
b2278cc
Compare
Greptile SummaryThis PR implements tight coupling between virtual memory budgets and physical Key Changes
Issues FoundCritical: The Race Condition: Between calculating Deadlock Detection: The Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Executor as ResourceBoundedThreadExecutor
participant Pool as HostMemoryPool
participant Runner as MemoryBoundedAsyncRunner
participant BaseAlloc as Base HostMemoryAllocator
participant Handler as OnCloseHandler
Note over Executor,Pool: Acquisition Phase
Executor->>Pool: acquire(runner, timeout)
Pool->>Pool: Check remaining >= requiredMemory
alt Sufficient memory
Pool->>Pool: remaining -= requiredMemory
Pool->>Runner: onStart(pool)
Runner->>Runner: localPool = requiredMemory<br/>usedMem = 0
Pool-->>Executor: AcquireSuccessful
else Insufficient but numInFlight==0
Pool->>Pool: Over-commit (remaining < 0)
Pool->>Runner: onStart(pool)
Pool-->>Executor: AcquireSuccessful (deadlock prevention)
end
Note over Runner,Handler: Execution Phase - Allocation
Runner->>Runner: allocate(size)
Runner->>Runner: usedMem.updateAndGet(curUsed + size)
alt Local pool sufficient
Runner->>BaseAlloc: allocate(size)
BaseAlloc-->>Runner: HostMemoryBuffer
Runner->>Handler: Register OnCloseHandler
else Local pool insufficient
Runner->>Pool: borrowMemory(deficit)
Pool->>Pool: Block until available or deadlock prevention
Pool->>Pool: remaining -= deficit
Runner->>Runner: localPool += deficit
Runner->>BaseAlloc: allocate(size)
BaseAlloc-->>Runner: HostMemoryBuffer
Runner->>Handler: Register OnCloseHandler
end
Note over Handler,Pool: Buffer Close - Event Driven
Handler->>Handler: onClosed(refCount==0)
Handler->>Runner: withStateLock()
Handler->>Runner: usedMem -= bufferSize
Handler->>Runner: bufCloseCond.signal()
alt Runner not Running && !closeStarted
Handler->>Pool: release(runner, forcefully=false)
Pool->>Runner: tryFree(byForce=false)
Runner->>Runner: freed = localPool - usedMem
Runner->>Runner: localPool -= freed
Pool->>Pool: remaining += freed
Pool->>Pool: Wake borrowers/acquirers
alt localPool == 0
Pool->>Runner: closeRunner()
Runner->>Runner: onClose() - wait for usedMem==0
Runner->>Runner: setState(Closed)
end
end
Note over Runner,Pool: Force Close on Completion
Executor->>Runner: close()
Runner->>Pool: finishUpRunner()
Pool->>Pool: release(runner, forcefully=true)
Pool->>Runner: tryFree(byForce=true)
Runner->>Runner: freed = localPool
Pool->>Pool: remaining += freed
Pool->>Runner: closeRunner()
Runner->>Runner: onClose() - wait for usedMem==0
Runner->>Runner: setState(Closed)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (4)
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 454-456 (link)logic: requires holding state lock but
allocateis called from file reader threads that may not hold it -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 507-511 (link)logic: potential deadlock:
OnCloseHandler.onClosedtries to acquire state lock while already holdingbufCloseLock, butonClose()holds state lock then acquiresbufCloseLock(line 415) - classic lock ordering violation -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/ResourcePools.scala, line 265-268 (link)logic: deadlock prevention logic uses >= but initializes
numRunnerInFlightto 0 - when all runners are borrowing (numOfBorrowWaiters + 1 == numRunnerInFlight.get()), this would allow infinite over-commit -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 458-463 (link)logic: race condition:
memToBrwcalculated insideupdateAndGetclosure but used outside - another thread could modifylocalPoolbetween lines 463 and 470, makingmemToBrwstale
9 files reviewed, 4 comments
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (5)
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 451-482 (link)logic: Potential deadlock:
allocaterequires holdingstateLockand callsborrowMemory, which acquirespoolLock. ButOnCloseHandler.onClosed(line 496) callswithStateLock()while potentially being called from a thread already holdingpoolLock(duringreleaseat line 220-228). This creates a lock ordering issue: Thread A (poolLock→stateLock) vs Thread B (stateLock→poolLock).The issue: When
releasereturns freed memory and signals waiting threads (lines 220-228 underpoolLock), those awakened threads might immediately close buffers, triggeringOnCloseHandler.onClosedcallbacks which needstateLock. Meanwhile, a running task holdingstateLockinallocatemay try to acquirepoolLockviaborrowMemory.Suggested fix: Consider using
tryLockwith timeout inOnCloseHandler, or restructure locking soborrowMemorydoesn't require holdingstateLock. -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 413-425 (link)logic: Blocking
onClose()with unbounded wait could hang indefinitely if buffers are never closed (e.g., leaked references). While the 30s timeout logging helps debugging, the loop never gives up. Consider adding a maximum total wait time or a force-close mechanism after a reasonable timeout to prevent indefinite hangs. -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/ResourcePools.scala, line 260-295 (link)logic: Edge case in deadlock prevention: if
numOfBorrowWaiters + 1 >= numRunnerInFlight.get()(line 266), the borrower proceeds immediately even ifremaining < 0. However,numRunnerInFlightis decremented in the post-hook (line 154-163) which runs AFTER the runner completes. If all runners are blocked inborrowMemorywaiting for memory, but haven't decrementednumRunnerInFlightyet, the condition may never trigger correctly. The timing window where runners finish execution but haven't yet run their post-hooks could cause false negatives for deadlock detection. -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 456-471 (link)logic: Race condition in memory accounting:
usedMem.updateAndGet(line 457) updates atomically, but the subsequentborrowMemorycall (line 468) and secondusedMem.addAndGet(line 470) are separate operations. If another thread closes a buffer between lines 462-468, theOnCloseHandlerwill decrementusedMemand potentially calltryFree/release, which could see inconsistent state where memory was borrowed but not yet added tousedMem. This could lead to over-release of memory back to the pool. -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/ResourcePools.scala, line 234-237 (link)logic: Auto-close logic race: checking
remain == 0LandcloseStarted.compareAndSet(false, true)(line 235) outside runner's state lock creates a race window. If two threads callreleaseconcurrently (one fromOnCloseHandler, one from explicit release), both might seeremain == 0and one wins the CAS, but the other continues with invalid assumptions. Should verifyremain == 0is checked under lock or document thread-safety assumptions more clearly.
9 files reviewed, 5 comments
|
build |
1 similar comment
|
build |
@greptileai As for the first point, I don't think it will lead to a deadlock issue regarding that once the lock ordering ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make a good point about the lock ordering analysis. Let me examine this more carefully.
You're correct that during the RUNNING state, the runner should hold its stateLock from beginning to end, which would prevent the poolLock → stateLock ordering scenario I described.
However, looking at the code more closely, I see a potential issue with this assumption. In ResourceBoundedThreadExecutor.afterExecute() (lines 265-278), the stateLock is released with releaseAnyway = true, but then mgr.release() is called which can trigger the OnCloseHandler callbacks. Here's the sequence:
afterExecute()callsrunner.withStateLock(releaseAnyway = true)- This releases the
stateLockat the end - Then
mgr.release(rr, forcefully = false)is called - This can trigger buffer closures and
OnCloseHandler.onClosed()callbacks - Those callbacks try to acquire
stateLockagain viawithStateLock()
So the runner might not actually hold stateLock continuously through the entire lifecycle as expected.
But you're right that the specific deadlock scenario I described (poolLock → stateLock vs stateLock → poolLock) seems less likely given the current control flow.
The more realistic concern might be:
- Thread A: In
OnCloseHandler.onClosed(), holdsstateLockand callspoolPtr.release()which needspoolLock - Thread B: In
HostMemoryPool.release(), holdspoolLockand the release triggers other buffer closures that needstateLock
But even this seems unlikely given the current implementation structure.
You've convinced me that my initial deadlock concern was overstated. The lock ordering appears to be more controlled than I initially assessed. Thank you for the correction!
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 482-489 (link)logic: Potential deadlock:
allocate()holdsstateLockand callsborrowMemory()which acquirespoolLock, whileOnCloseHandler.onClosed()may be triggered from threads holdingpoolLockand needsstateLock. This creates a lock ordering conflict that could cause deadlocks. Have you verified the lock ordering is consistent between the allocation path and the buffer close event path to prevent deadlocks? -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 475-489 (link)logic: Race condition in memory accounting: gap between
usedMem.updateAndGet(),borrowMemory(), and secondusedMem.addAndGet()can lead to inconsistent state if concurrent buffer closes occur between these operations.
11 files reviewed, 2 comments
|
build |
|
build |
The lock ordering conflict you pointed out should not happen because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct that the specific scenario you described (borrow during Running + OnCloseHandler calling poolPtr.release) won't happen because of the state check. However, there's still a potential deadlock risk in the current design.
The issue is that OnCloseHandler.onClosed() always acquires the stateLock first (line 420 in r.withStateLock[Unit]() { _ =>), regardless of the runner state. Meanwhile, allocate() holds the stateLock and then calls borrowMemory() which acquires the poolLock.
Here's the potential deadlock scenario:
Thread A (allocation path):
- Holds
stateLock(in allocate()) - Calls
borrowMemory()→ tries to acquirepoolLock
Thread B (buffer close path):
- Holds
poolLock(from some pool operation) - Buffer gets closed →
OnCloseHandler.onClosed()→ tries to acquirestateLock
This creates a classic lock ordering deadlock: A holds stateLock→wants poolLock, B holds poolLock→wants stateLock.
Suggested fix:
Consider restructuring the lock acquisition order to be consistent. For example:
override def allocate(size: Long, preferPinned: Boolean): HostMemoryBuffer = {
require(getState == Running, s"Memory allocation is only allowed in Running state: $this")
require(isHoldingStateLock, s"The caller must hold the state lock: $this")
// Check and update the used memory atomically
var memToBrw = 0L
var newUsed = usedMem.updateAndGet { curUsed: Long =>
val newUsed = curUsed + size
memToBrw = newUsed - localPool
newUsed min localPool
}
// If the local pool is insufficient, try to borrow from the global pool
// Release stateLock before acquiring poolLock to maintain consistent lock ordering
var borrowedAmount = 0L
if (memToBrw > 0) {
logWarning(
s"[runnerID=$runnerId] LocalMemPool ${bToStr(localPool)}(used ${bToStr(newUsed)}) " +
s"is NOT enough for the ALLOC(${bToStr(size)}): try to borrow ${bToStr(memToBrw)}")
// Temporarily release state lock to avoid deadlock
stateLock.unlock()
try {
poolPtr.asInstanceOf[HostMemoryPool].borrowMemory(memToBrw)
borrowedAmount = memToBrw
} finally {
stateLock.lock()
// Re-verify state after reacquiring lock
require(getState == Running, s"Runner state changed during borrow: $this")
}
localPool += borrowedAmount
newUsed = usedMem.addAndGet(borrowedAmount)
}
if (newUsed > peakUsedMem) {
peakUsedMem = newUsed
}
// Call the base allocator to allocate the actual buffer
val buf = withRetryNoSplit[HostMemoryBuffer] {
baseMemoryAllocator.allocate(size, preferPinned)
}
// Register a close handler to return the memory back either to the local or global pool
HostAlloc.addEventHandler(buf, new OnCloseHandler(size, this))
buf
}
This ensures consistent lock ordering: always acquire poolLock before stateLock when both are needed.
@greptileai I do not think Thread B (buffer close path) will hold poolLock under any circumstance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right! Looking at the code more carefully, I can see that Thread B (buffer close path) would not hold the poolLock when OnCloseHandler.onClosed() is called.
The OnCloseHandler.onClosed() method is triggered by the buffer's reference counting mechanism when refCount reaches 0, which happens when the buffer is closed/released. This is not called from within any pool operation that would already be holding the poolLock.
The potential deadlock scenario I described is not valid because:
- Thread B starts from buffer deallocation (when refCount hits 0)
- It only tries to acquire the
stateLockinOnCloseHandler.onClosed() - It does not start with the
poolLockalready held
The lock ordering issue doesn't exist here because there's no case where Thread B would be holding poolLock and then trying to acquire stateLock.
Thank you for the correction! The current implementation is actually safe from this particular deadlock scenario. The OnCloseHandler is triggered purely by buffer lifecycle events (refcount reaching 0), not by pool operations that would already be holding locks.
The design is sound - when buffers are closed, they simply try to acquire the runner's stateLock to update memory accounting, and there's no cross-dependencies with the pool lock in this path.
@greptileai Can you elaborate the point a little bit? |
|
Looking at lines 475-489 in the PR code, I can see the race condition issue in the memory allocation logic. Let me break down what's happening: // Check and update the used memory atomically
var memToBrw = 0L
var newUsed = usedMem.updateAndGet { curUsed: Long =>
val newUsed = curUsed + size
memToBrw = newUsed - localPool
newUsed min localPool
}
// If the local pool is insufficient, try to borrow from the global pool
if (memToBrw > 0) {
// ... logging ...
// Blocking call to borrow memory from the global pool
poolPtr.asInstanceOf[HostMemoryPool].borrowMemory(memToBrw)
localPool += memToBrw
newUsed = usedMem.addAndGet(memToBrw)
}The Race Condition: The problem is that there are three separate atomic operations happening sequentially:
Between these operations, other threads can execute buffer close handlers ( Inconsistent State Scenarios:
Example:
Now The fix would be to make the entire allocation decision atomic, or use proper synchronization to prevent buffer closes from interfering with the allocation logic. |
@greptileai |
|
Thanks for the clarification! You're absolutely correct about
Regarding the over-borrowing issue you mentioned with the TODO, you're right that it's a rare edge case. For completeness though, here are a few lightweight approaches you might consider if/when you revisit this:
The current design is quite elegant - the event-driven release mechanism via Your clarification about |
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (5)
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 470-472 (link)logic: The
allocatemethod requires holding the state lock but is called from user code that may not hold it. ThereadPartFileinGpuParquetScan.scala:2806calls this allocator without acquiring the lock, which will fail the require check. Need to either remove the lock requirement or ensure all callers acquire it first.Check if callers like
readPartFileproperly acquirestateLockbefore calling allocations through this allocator. Does the Parquet reading code path acquire the runner's state lock before triggering allocations through the custom allocator? -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 476-480 (link)logic: Race condition:
usedMem.updateAndGetupdates atomically but thenmemToBrwis calculated and used outside the atomic operation. Between computingmemToBrwand callingborrowMemory, another thread could close a buffer (viaOnCloseHandler) and reduceusedMem, making the borrow amount incorrect. -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/ResourcePools.scala, line 266-267 (link)logic: Deadlock prevention logic checks
numOfBorrowWaiters + 1 >= numRunnerInFlight.get()butnumRunnerInFlightis decremented in a post-hook that runs outside thepoolLock. This creates a window where the count could be stale, potentially allowing unnecessary over-commits or missing legitimate deadlock scenarios.Consider synchronizing the post-hook decrement with the pool lock or using a more robust deadlock detection mechanism.
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/ResourcePools.scala, line 288-291 (link)style: The cascading signal logic from
borrowConditiontoacquireConditionis clever but fragile. If multiple borrowers are awakened concurrently, each might try to signalacquireCondition, and thenumOfBorrowWaiters == 0check may not accurately reflect pending borrowers still in the process of being awakened.This is a minor edge case but could lead to suboptimal scheduling under high contention.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/io/async/AsyncRunners.scala, line 516 (link)style: TODO should be addressed - returned memory could satisfy pending borrow requests but doesn't trigger them, leading to suboptimal resource utilization
11 files reviewed, 5 comments
|
build |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications! Let me address each point:
-
State lock requirement: Got it - so the
allocatemethod will only be called when the runner is wrapped as aRapidsFutureTaskand scheduled byResourceBoundedThreadExecutor, which ensures the state lock is properly held during execution. The direct calls fromreadPartFilewould use the default allocator, not this custom one. That makes sense. -
Race condition concern: You're right that the atomic variables ensure correctness. Even if there's a race between calculating
memToBrwand callingborrowMemory, the worst case is slightly over-borrowing, but the accounting remains consistent sinceusedMemandlocalPoolupdates are atomic. -
Deadlock prevention: Understood - the over-commit behavior is intentional for deadlock prevention, and some staleness in
numRunnerInFlightis acceptable since it errs on the side of allowing progress rather than strict accounting. -
Cascading signal logic: This is indeed a subtle edge case. Under high contention with multiple borrowers, the
numOfBorrowWaiters == 0check might not perfectly capture the transitional state of borrowers being awakened. A potential improvement could be to track borrower state more precisely, but as you noted, this is a minor optimization issue. -
TODO comment: Since you mentioned this is trivial, a simple fix would be to check if there are pending borrow requests after returning memory and signal the
borrowConditionaccordingly. Something like:
// TODO: check if the returned memory satisfies the ongoing borrow requests
usedMem.addAndGet(-bufferSize)
bufCloseCond.signal() // awaken onClose waiting thread if exists
logDebug(s"[OnCloseHandler Closed] bufferSize=${bToStr(bufferSize)} for $r")
// Try to notify borrowers if memory was returned to the pool
if (getState != Running && !closeStarted.get()) {
val (freed, _) = poolPtr.release(r, forcefully = false)
if (freed > 0 && poolPtr.isInstanceOf[HostMemoryPool]) {
// Potential optimization: signal borrowers about newly available memory
}
}
The design looks solid overall - the tight coupling between virtual budget and physical memory allocation through the MemoryBoundedAsyncRunner serving as both resource holder and allocator is elegant and should provide much better resource management than the previous approach.
|
Thanks @sperlingxx ! Followup on the
|
|
Are there tests or would it be possible to add tests that force the deadlock conditions to occur and exercise those code paths? |
Closes #13969
Overview
This PR tightly couples the virtual memory budget with the lifecycle of the actual memory buffer
HostMemoryBufferused in the runner, by makingMemoryBoundedAsyncRunnerserve as both the resource holder and theHostMemoryAllocator. This design eliminates the previous disconnect between budget accounting and actual memory usage, enabling more precise resource management and improved concurrency.Key Features
Performance comparison between V1 and V2
Design
Architecture Diagram
1. LocalPool Initialization
When a
MemoryBoundedAsyncRunneris scheduled for execution, it acquires a preliminary memory budget from the globalHostMemoryPool(managed byResourceBoundedExecutor). This budget becomes the runner's LocalPool—a private memory quota that the runner manages independently during its execution lifecycle.The initial LocalPool size is typically derived from the
PartitionedFilesplit length, representing an upper-bound estimate of the memory required to process the assigned data partition.2. LocalPool Structure: Used vs Free
The LocalPool is logically divided into two portions:
HostMemoryBufferinstances (tracked byusedMem)localPool - usedMem)This partitioning allows the runner to track exactly how much of its budget is actively in use versus how much remains available—enabling early release of over-claimed budget.
3. Allocation Flow: Local-First with Dynamic Borrowing
When a buffer allocation request arrives, the runner follows a local-first strategy:
HostMemoryPoolBorrowing Semantics:
HostMemoryPoolwith a negative remaining balance, but guarantees forward progressThis dynamic borrowing mechanism handles cases where the initial budget estimate is insufficient—such as when file readers need to access metadata beyond the split boundaries (footers, adjacent row groups, etc.).
4. Deallocation Flow: Event-Driven Budget Return
Buffer release triggers an automatic cascade of budget management:
Step 1: Return to LocalPool
When a
HostMemoryBufferis closed (refCount reaches 0), the attachedOnCloseHandlerfires and returns the corresponding virtual budget back to the runner's LocalPool (decrementsusedMem).Step 2: Early Release via tryFree
If the runner has completed execution (no longer in
Runningstate), the handler triggerstryFreeto immediately return the free portion of LocalPool back to the globalHostMemoryPool. This releases over-claimed budget as early as possible, improving pool utilization and allowing other runners to be scheduled sooner.Step 3: Auto-Close on Full Drain
When LocalPool drops to zero—meaning all physical buffers have been closed and all budget has been returned—the runner can be safely closed automatically. This simplifies lifecycle management by eliminating explicit close coordination.
Code References
ResourcePools.scalaResourcePools.scalaResourcePools.scalaAsyncRunners.scalalocalPool(capacity) andusedMem(AtomicLong)AsyncRunners.scalaAsyncRunners.scalaAsyncRunners.scalaAsyncRunners.scalaResourceBoundedThreadExecutor.scalaResourceBoundedThreadExecutor.scala