[Experiment] Evaluate perf impact of striped vs. blocked SLM read/write 1D copy atoms#631
Closed
sanchitintel wants to merge 1 commit intomainfrom
Closed
[Experiment] Evaluate perf impact of striped vs. blocked SLM read/write 1D copy atoms#631sanchitintel wants to merge 1 commit intomainfrom
sanchitintel wants to merge 1 commit intomainfrom
Conversation
BMG & PVC support batched & striped SLM <-> registers transfers. `store.slm.d32x4.a32` and `store.slm.d64x2.a32` load 128 bits per work-item but cause bank conflicts. Switching to striped loads & stores (each work-item transfers one item) to check performance impact (and any potential breakages). Even if we have BF16 data to move to/from SLM, we can reinterpret cast it to a dtype whose size is equal to the width of each lane's bank. Performance characteristics of either don't seem to be available in the public domain.
|
Close this PR which didn't update >90 days. The project has changed a lot, this PR is not applicable any more, create a new PR please in case you need it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The sycl group/load API used for 1D SLM <-> registers copies support both blocked & striped copies. For example,
store.slm.d32x4.a32andstore.slm.d64x2.a32store 128 bits per work-item (blocked) but cause bank conflicts. Blocked layout is the default data placement for both APIs.Switching to striped loads & stores (each work-item transfers one item) as an experiment to check performance impact (and any potential breakages). Stores seem to use messages such as
store.slm.d64.a32orstore.slm.d32x2.a32, but if the bank width of BMG/PVC is 64 bits (the documentation states 32 bits, but that part may not have been updated for Xe12), then they write to SLM by avoiding bank conflicts.Performance characteristics of either type of instructions don't seem to be available in the public domain. It's even possible that the first type may perform better due to fewer block messages (as they transfer twice the data as the instructions of the second type), although they entail bank conflicts.
Even if we have BF16 data to move to/from SLM, we could reinterpret cast it to a dtype whose size is equal to the width of each lane's bank.
While 1D loads to/from Global Memory also support striped reads/stores, bank conflicts aren't an issue, so I didn't modify the corresponding copy atoms.
Type
Performance
Testing
Performance
cc @pengzhao-intel