fix: Validate fp16.loss_scale is finite and non-negative by nathon-lee · Pull Request #7889 · deepspeedai/DeepSpeed

nathon-lee · 2026-03-06T07:44:29Z

Validate fp16.loss_scale is finite and non-negative

Add a Pydantic field validator to DeepSpeedFP16Config to reject NaN/inf/-inf and negative values for fp16.loss_scale (while keeping 0 as dynamic loss scaling). This prevents invalid configs from silently initializing and causing NaNs during training.

Fix issue #7852

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0059a795a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-06T07:47:08Z

deepspeed/runtime/precision_config.py

    Loss scaling value. Default value of 0 means dynamic loss scaling instead of static loss scale.
    """

+    @field_validator("loss_scale")


Run loss_scale validator before type coercion

This validator is declared with the default mode="after", so Pydantic will coerce inputs to float first; as a result, the new isinstance(v, bool) guard never triggers because true/false become 1.0/0.0 before _validate_loss_scale runs. In configs that set fp16.loss_scale to a boolean, the value is still silently accepted, which defeats the stated validation goal and can unexpectedly switch to static scaling (true -> 1.0).

Useful? React with 👍 / 👎.

I think this comment makes sense. Can you address it? @nathon-lee

Thanks — I agree this comment makes sense. I’ll address it and push an update shortly. @tohtana

Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).

PKUWZP

Switch to mode = before and add some tests.

PKUWZP · 2026-03-06T21:25:15Z

deepspeed/runtime/precision_config.py

    """

+    @field_validator("loss_scale")
+    @classmethod


Consider using mode="before" for the entire validator rather than splitting into two validators. A single
mode="before" validator can handle both the bool check and the finite/negative checks:

@classmethod def _validate_loss_scale(cls, v): if isinstance(v, bool): raise ValueError("fp16.loss_scale must be a number, not bool") v = float(v) if not math.isfinite(v): raise ValueError("fp16.loss_scale must be a finite number (not inf/-inf/nan)") if v < 0: raise ValueError("fp16.loss_scale must be >= 0 (0 enables dynamic loss scaling)") return v ```

Test coverage: There are no tests included. A few unit tests in tests/unit/runtime/ asserting that invalid loss_scale values (-1, float('inf'), float('nan'), True) raise ValidationError would strengthen this PR and prevent regressions.

The existing pattern in the repo uses DeepSpeedFP16Config(loss_scale=...) directly, which makes such tests straightforward.

Thanks — good suggestion. I’ll consolidate into a single mode="before" validator and add unit tests (e.g. -1, inf, nan, True -> ValidationError) using DeepSpeedFP16Config(loss_scale=...). I’ll push an update shortly. @PKUWZP

Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

nathon-lee requested review from tjruwase and tohtana as code owners March 6, 2026 07:44

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

nathon-lee changed the title ~~Validate fp16.loss_scale is finite and non-negative~~ fix: Validate fp16.loss_scale is finite and non-negative Mar 6, 2026

PKUWZP self-requested a review March 6, 2026 21:17

PKUWZP requested changes Mar 6, 2026

View reviewed changes

nathon-lee force-pushed the fix_issue_7852 branch from f0059a7 to 5c7b12e Compare March 7, 2026 02:53

nathon-lee requested review from GuanhuaWang, hwchen2017 and loadams as code owners March 7, 2026 02:53

nathon-lee force-pushed the fix_issue_7852 branch 2 times, most recently from f0059a7 to 3ead20d Compare March 7, 2026 03:20

fix: Validate fp16.loss_scale is finite and non-negative

225ab4e

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

nathon-lee force-pushed the fix_issue_7852 branch from 3ead20d to 225ab4e Compare March 7, 2026 03:27

fix: validate fp16.loss_scale before coercion

f2fc309

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

nathon-lee force-pushed the fix_issue_7852 branch from 9ad6000 to f2fc309 Compare March 7, 2026 04:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Validate fp16.loss_scale is finite and non-negative#7889

fix: Validate fp16.loss_scale is finite and non-negative#7889
nathon-lee wants to merge 2 commits intodeepspeedai:masterfrom
nathon-lee:fix_issue_7852

nathon-lee commented Mar 6, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Uh oh!

tohtana Mar 6, 2026

Uh oh!

nathon-lee Mar 7, 2026

Uh oh!

nathon-lee Mar 7, 2026

Uh oh!

PKUWZP left a comment

Uh oh!

PKUWZP Mar 6, 2026

Uh oh!

nathon-lee Mar 7, 2026

Uh oh!

nathon-lee Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nathon-lee commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

PKUWZP left a comment

Choose a reason for hiding this comment

Uh oh!

PKUWZP Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nathon-lee commented Mar 6, 2026 •

edited

Loading