Skip to content

[SPARK-56199][CORE] Read fallback storage blocks asynchronously and multithread#55003

Open
EnricoMi wants to merge 8 commits intoapache:masterfrom
G-Research:fallback-storage-multithread-read
Open

[SPARK-56199][CORE] Read fallback storage blocks asynchronously and multithread#55003
EnricoMi wants to merge 8 commits intoapache:masterfrom
G-Research:fallback-storage-multithread-read

Conversation

@EnricoMi
Copy link
Contributor

What changes were proposed in this pull request?

This changes how ShuffleBlockFetcherIterator fetches blocks from Fallback Storage. It now treats fallback storage blocks separate from local and more like remote blocks. It considers these blocks when updating requestsInFlight and bytesInFlight, so that not too many bytes are hold in memory while iterating over fallback storage blocks. Blocks are read asynchronously and multi-threaded.

Why are the changes needed?

Currently, fallback storage blocks are treated by ShuffleBlockFetcherIterator as local blocks, and only on an exception code path (when the block cannot be found as a local block), it is read from fallback storage. Fallback storage blocks are read one-by-one (single-threaded) and they block reading local blocks (synchronously).

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests are updated accordingly.

Was this patch authored or co-authored using generative AI tooling?

No.

Note: This includes #54268

Removes "Fast fail when failed to get fallback storage blocks" as fallback storage blocks
are fetched concurrently and fast fail is not guaranteed any more.
This blocks cleanup() calling fallbackStorageReadPool.shutdownNow() while
futures are locking results to put FailureFetchResults. Otherwise, that put
in catch clauses would be interrupted and that exception kills the executor.

Logging errors only if !isZombie, meaning the iterator is not yet cleaning up.

Further, FailureFetchResult are putFirst to stop iteration as quickly as possible.

Finally, fallbackStorageReadPool.shutdownNow() is only called in cleanup().
@EnricoMi
Copy link
Contributor Author

@attilapiros you might be interested in this given your comment #45228 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant