Skip to content

fix: FP8 fallback for AIU addons running on CPU#200

Open
andrea-fasoli wants to merge 9 commits intomainfrom
fp8_cpu
Open

fix: FP8 fallback for AIU addons running on CPU#200
andrea-fasoli wants to merge 9 commits intomainfrom
fp8_cpu

Conversation

@andrea-fasoli
Copy link
Collaborator

@andrea-fasoli andrea-fasoli commented Mar 19, 2026

Description of the change

Starting from PyTorch 2.10, torch._scaled_mm no longer supports FP8 matmul on CPU for any quantization scheme other than per-tensor. torch._scaled_mm through a call to addmm_float8_unwrapped_inference is currently called by the FP8 AIU addons when the model runs on CPU.

This PR implements a fallback in this scenario: we perform a mock FP8 x FP8 matmul on CPU using torch.nn.functional.linear between quantized/dequantized activations and dequantized weights. Notice we do not simply dequantize the weights.

Related issues or PRs

[internal issue]

How to verify the PR

Example of a test that should pass, ran on a pod with 4 AIUs, in PF mode, in PyTorch 2.10 env (set up env vars according to your case; AFTU = aiu-fms-testing-utils repo):

torchrun --nproc-per-node 4 ${AFTU_PATH}/scripts/drive_paged_programs.py --model_variant ${FP8_MODEL_PATH} --max_new_tokens 128 --timing per-token --dataset_type sharegpt --dataset_path ${DATASET_PATH} --test_type metrics --program_criteria_json_path ${PROGRAMS_FILE} --programs ${SELECTED_PROGRAM} --attention_type paged_fp8 --save_validation_info_outputs --validation_info_outputs_dir ${OUTPUT_DIR} --prefill_chunk_size 1024 --cross_entropy_threshold 2.6 --failure_rate_threshold 0.1 --prioritize_large_batch_sizes --enforce_homogeneous_prompt_programs --distributed

Was the PR tested

  • I have ensured all unit tests pass

Checklist for passing CI/CD:

  • All commits are signed showing "Signed-off-by: Name <email@domain.com>" with git commit -signoff or equivalent
  • PR title and commit messages adhere to Conventional Commits
  • Contribution is formatted with pre-commit
  • Contribution passes all unit tests with tox -e unit

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
@andrea-fasoli
Copy link
Collaborator Author

@ani300 need your eyes on this

# Perform mock FP8xFP8 matmul
if is_cpu and not is_per_tensor and not SUPPORTS_CPU_PER_CHANNEL_FP8:
x_dequant = qx.dequantize()
w_dequant = qweight.dequantize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect this to affect the quality significantly ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything it'll improve it on cpu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Since we use these numbers to compare against accelerator results, this can cause wider deviation between those results? Unless the diff is quite small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're downcasting back to fp8 anyways, so it shouldn't be too different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also expect a very minimal discrepancy in terms of generation compared to the earlier operation.

There may be some runtime overhead, as this new fallback is likely less performant than calling torch._scaled_mm. To clarify: potential overheads on CPU validation only, no impact at all on AIU runtime.

Copy link
Contributor

@ani300 ani300 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! the fix makes sense

@ani300
Copy link
Contributor

ani300 commented Mar 19, 2026

is it worth adding a test to check if the combination that was failing before works now and in the future?

Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Signed-off-by: Andrea Fasoli <andrea.fasoli@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants