fix: FP8 fallback for AIU addons running on CPU#200
fix: FP8 fallback for AIU addons running on CPU#200andrea-fasoli wants to merge 9 commits intomainfrom
Conversation
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
|
@ani300 need your eyes on this |
| # Perform mock FP8xFP8 matmul | ||
| if is_cpu and not is_per_tensor and not SUPPORTS_CPU_PER_CHANNEL_FP8: | ||
| x_dequant = qx.dequantize() | ||
| w_dequant = qweight.dequantize() |
There was a problem hiding this comment.
Do we expect this to affect the quality significantly ?
There was a problem hiding this comment.
If anything it'll improve it on cpu
There was a problem hiding this comment.
Make sense. Since we use these numbers to compare against accelerator results, this can cause wider deviation between those results? Unless the diff is quite small.
There was a problem hiding this comment.
we're downcasting back to fp8 anyways, so it shouldn't be too different.
There was a problem hiding this comment.
I would also expect a very minimal discrepancy in terms of generation compared to the earlier operation.
There may be some runtime overhead, as this new fallback is likely less performant than calling torch._scaled_mm. To clarify: potential overheads on CPU validation only, no impact at all on AIU runtime.
ani300
left a comment
There was a problem hiding this comment.
lgtm! the fix makes sense
|
is it worth adding a test to check if the combination that was failing before works now and in the future? |
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Description of the change
Starting from PyTorch 2.10,
torch._scaled_mmno longer supports FP8 matmul on CPU for any quantization scheme other than per-tensor.torch._scaled_mmthrough a call toaddmm_float8_unwrapped_inferenceis currently called by the FP8 AIU addons when the model runs on CPU.This PR implements a fallback in this scenario: we perform a mock FP8 x FP8 matmul on CPU using
torch.nn.functional.linearbetween quantized/dequantized activations and dequantized weights. Notice we do not simply dequantize the weights.Related issues or PRs
[internal issue]
How to verify the PR
Example of a test that should pass, ran on a pod with 4 AIUs, in PF mode, in PyTorch 2.10 env (set up env vars according to your case; AFTU = aiu-fms-testing-utils repo):
Was the PR tested
Checklist for passing CI/CD:
git commit -signoffor equivalentpre-committox -e unit