[Feat] Add FP8 training support by fishcrap · Pull Request #758 · inclusionAI/AReaL

fishcrap · 2025-12-23T10:59:35Z

Description

This PR adds comprehensive FP8 (8-bit floating point) training support to AReaL, enabling memory-efficient training with low precision while maintaining training stability. The implementation includes:

FP8 quantization/dequantization utilities: New fp8_utils.py and fp8_kernels.py modules providing blockwise quantization support
CLI configuration: Extended TrainEngineConfig with FP8-related options (fp8 mode, recipe, parameter quantization, etc.)
Model loading/saving: Updated HuggingFace model loading and saving to handle FP8 weights with proper conversion between PyTorch FP8 and Transformer Engine FP8 formats
Megatron engine integration: Enhanced MegatronEngine to support FP8 training with proper configuration propagation
Comprehensive test suite: Added extensive tests for FP8 conversion, BF16 comparison, and gradient correctness

The implementation supports the blockwise scheme, with integration into Transformer Engine's FP8 infrastructure for efficient GEMM operations.

Related Issue

Fixes #(issue)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A - This is a new feature that adds optional FP8 support without breaking existing functionality.

Additional Context

Training Curve

reward (fp8 vs bf16)

TODO:

Memory profiling
Training time reduction

gemini-code-assist · 2025-12-23T11:00:03Z

Summary of Changes

Hello @fishcrap, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades AReaL by integrating comprehensive FP8 training support. The primary goal is to enable memory-efficient training with reduced precision without compromising model stability. This is achieved through the introduction of new FP8 quantization and dequantization utilities, extensive configuration options via the CLI, and updates to model loading and saving processes to handle FP8 weights. The core MegatronEngine has been adapted to leverage these FP8 capabilities, and new tests ensure the reliability of these low-precision operations.

Highlights

Comprehensive FP8 Training Support: This PR introduces full 8-bit floating point (FP8) training capabilities to AReaL, enabling more memory-efficient training while striving to maintain training stability.
FP8 Quantization Utilities: New modules fp8_utils.py and fp8_kernels.py have been added, providing blockwise and per-tensor quantization and dequantization functionalities, including Triton-based kernels for efficient operations.
Extended CLI Configuration: The TrainEngineConfig and MegatronEngineConfig have been significantly extended with numerous FP8-related options, allowing users to configure FP8 mode, scaling recipes, parameter quantization, and other precision-related settings via the command-line interface.
Enhanced Model Loading and Saving: HuggingFace model loading and saving mechanisms have been updated to correctly handle FP8 weights, including proper conversion between PyTorch FP8 and Transformer Engine FP8 formats, and dequantization when necessary.
MegatronEngine Integration: The MegatronEngine has been enhanced to seamlessly support FP8 training, ensuring that FP8 configurations are correctly propagated and applied throughout the training process.
New Test Suite: A comprehensive test suite (test_fp8_conversion.py) has been added to verify the correctness of FP8 conversion, compare results with BF16 baselines, and ensure gradient accuracy.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces comprehensive FP8 training support, including new utilities for quantization/dequantization, CLI configurations, and updates to model loading/saving to handle FP8 weights. The changes are extensive and well-structured. I've identified a few areas with TODO or FIXME comments in the new code, particularly in tests and utility functions, that should be addressed to ensure correctness and clarity. The overall implementation seems robust, with good integration into the existing MegatronEngine and the addition of a comprehensive test suite.

rchardx · 2025-12-26T04:36:00Z

I think end-to-end training testcases should be added to areal/tests/grpo/ or areal/tests/sft/ through new yaml configurations and new test entries.

garrett4wade · 2025-12-26T09:47:08Z

I think end-to-end training testcases should be added to areal/tests/grpo/ or areal/tests/sft/ through new yaml configurations and new test entries.

@rchardx It would be good but the test won't run in CI A100 nodes. We can just run them offline.

rchardx

The core functionality looks solid. Once these changes (including others made ones) are addressed, this PR should be ready to merge.

rchardx · 2025-12-27T02:00:03Z

I think end-to-end training testcases should be added to areal/tests/grpo/ or areal/tests/sft/ through new yaml configurations and new test entries.

@rchardx It would be good but the test won't run in CI A100 nodes. We can just run them offline.

Agreed.

- Move fp8 utilities to areal/utils/fp8/ with clearer module separation - Implement UE8M0 quantization locally, eliminating sglang import - Extract common utils: areal/utils/math.py, areal/utils/cuda.py - Improve constants.py organization and naming - Clarify high_precision_init_val comment for FP8 HF model loading 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Use lazy initialization for DeepGEMM detection to avoid import-time CUDA access failures on CPU-only environments - Add informative error message for UE8M0 block size assertion - Document FP8 E4M3 max value (448.0) in quantization code

rchardx

LGTM!

This PR adds comprehensive FP8 (8-bit floating point) training support to AReaL, enabling memory-efficient training with low precision while maintaining training stability. The implementation includes: - **FP8 quantization/dequantization utilities**: New `fp8_utils.py` and `fp8_kernels.py` modules providing blockwise quantization support - **CLI configuration**: Extended `TrainEngineConfig` with FP8-related options (fp8 mode, recipe, parameter quantization, etc.) - **Model loading/saving**: Updated HuggingFace model loading and saving to handle FP8 weights with proper conversion between PyTorch FP8 and Transformer Engine FP8 formats - **Megatron engine integration**: Enhanced `MegatronEngine` to support FP8 training with proper configuration propagation - **Comprehensive test suite**: Added extensive tests for FP8 conversion, BF16 comparison, and gradient correctness The implementation supports the blockwise scheme, with integration into Transformer Engine's FP8 infrastructure for efficient GEMM operations.

fishcrap added 21 commits December 2, 2025 11:59

add megatron training args

be694c3

fix for dsv3

398b4e0

fp8 align 16 for training input

d160507

add fp8 update weight which needs quantize to fp8 first

c9fa040

add sglang online quant

4bb22b9

Merge remote-tracking branch 'github/main' into sxj/fp8_train

24fda85

add online dequant and quant in megatron save load

42c5844

fix shape

e949d0c

convert pytorch fp8 to transformer_engine fp8

731b11d

fix load

2e3ac85

fix fp8 scale_inv and weight not in same bin

cc2bf1e

fix fp8 load

ed48b0e

fix hf save

ce8e6e0

fix fp8 save

8e203be

add fp8 tests

e968de7

fix fp8_param weight update

55e36a3

fix hf_load

dc5b71d

add fp8_recipe in optimizer

384cbaf

default scale_inv dtype bfloat16

176bd26

fix megatron distributed

0edd0a4

Merge remote-tracking branch 'origin/main' into sxj/fp8_train

b683eb0

gemini-code-assist Bot reviewed Dec 23, 2025

View reviewed changes

refactor fp8 tests

1b81d61

fishcrap changed the title ~~Sxj/fp8 train~~ [Feat] Add FP8 training support Dec 24, 2025

fishcrap added 5 commits December 24, 2025 12:47

fix test names

31df0ef

use refactered forward in tests

ca7c973

use refactered train in tests

18ddcbb

fix and refactor fp8 tests

5ae8bd6

fix

25650c1

fishcrap added 2 commits December 25, 2025 15:09

fix

d4e0e4e

add fp8 training config

eb7d230

fishcrap requested review from garrett4wade, nuzant and rchardx December 25, 2025 12:07

fishcrap marked this pull request as ready for review December 25, 2025 12:07

change fp8 constraint

fa740ed

fishcrap requested review from garrett4wade and removed request for garrett4wade December 26, 2025 05:36

garrett4wade reviewed Dec 26, 2025

View reviewed changes

Comment thread areal/models/mcore/hf_save.py

Comment thread areal/api/cli_args.py Outdated

Comment thread areal/utils/fp8_utils.py Outdated

Comment thread areal/tests/fp8/test_fp8_rmsnorm.py

rchardx requested changes Dec 27, 2025

View reviewed changes

fishcrap added 5 commits December 29, 2025 11:39

fix comments

b8edd1f

move fp8-related tests to fp8 dir

044d1e5

fix weight_block_size

e621a23

fix fp8 yaml

c954ca4

del uselss code

3c70c6b

garrett4wade reviewed Dec 30, 2025

View reviewed changes

Comment thread areal/engine/megatron_engine.py Outdated

Comment thread areal/engine/megatron_engine.py Outdated

Comment thread areal/utils/megatron.py Outdated

Comment thread areal/engine/megatron_engine.py Outdated

fishcrap and others added 5 commits December 30, 2025 11:53

del useless inference_enable_ep_moe config

d46de50

fix comments

a2347f6

Merge gh into sxj/fp8_train: resolve constants.py conflict

bebdd45

rchardx approved these changes Dec 31, 2025

View reviewed changes

rchardx merged commit 89dda13 into main Dec 31, 2025
1 check passed

rchardx deleted the sxj/fp8_train branch December 31, 2025 06:33

Conversation

fishcrap commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Additional Context

Training Curve

TODO:

Uh oh!

gemini-code-assist Bot commented Dec 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rchardx commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade commented Dec 26, 2025

Uh oh!

rchardx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rchardx commented Dec 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rchardx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fishcrap commented Dec 23, 2025 •

edited

Loading

rchardx commented Dec 26, 2025 •

edited

Loading