Skip to content

fix: Validate fp16.loss_scale is finite and non-negative#7889

Open
nathon-lee wants to merge 2 commits intodeepspeedai:masterfrom
nathon-lee:fix_issue_7852
Open

fix: Validate fp16.loss_scale is finite and non-negative#7889
nathon-lee wants to merge 2 commits intodeepspeedai:masterfrom
nathon-lee:fix_issue_7852

Conversation

@nathon-lee
Copy link
Contributor

@nathon-lee nathon-lee commented Mar 6, 2026

Validate fp16.loss_scale is finite and non-negative

Add a Pydantic field validator to DeepSpeedFP16Config to reject NaN/inf/-inf and negative values for fp16.loss_scale (while keeping 0 as dynamic loss scaling). This prevents invalid configs from silently initializing and causing NaNs during training.

Fix issue #7852

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0059a795a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Loss scaling value. Default value of 0 means dynamic loss scaling instead of static loss scale.
"""

@field_validator("loss_scale")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run loss_scale validator before type coercion

This validator is declared with the default mode="after", so Pydantic will coerce inputs to float first; as a result, the new isinstance(v, bool) guard never triggers because true/false become 1.0/0.0 before _validate_loss_scale runs. In configs that set fp16.loss_scale to a boolean, the value is still silently accepted, which defeats the stated validation goal and can unexpectedly switch to static scaling (true -> 1.0).

Useful? React with 👍 / 👎.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment makes sense. Can you address it? @nathon-lee

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — I agree this comment makes sense. I’ll address it and push an update shortly. @tohtana

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).

@nathon-lee nathon-lee changed the title Validate fp16.loss_scale is finite and non-negative fix: Validate fp16.loss_scale is finite and non-negative Mar 6, 2026
@PKUWZP PKUWZP self-requested a review March 6, 2026 21:17
Copy link
Collaborator

@PKUWZP PKUWZP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to mode = before and add some tests.

"""

@field_validator("loss_scale")
@classmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Consider using mode="before" for the entire validator rather than splitting into two validators. A single
    mode="before" validator can handle both the bool check and the finite/negative checks:

    @classmethod
    def _validate_loss_scale(cls, v):
        if isinstance(v, bool):
            raise ValueError("fp16.loss_scale must be a number, not bool")
        v = float(v)
        if not math.isfinite(v):
            raise ValueError("fp16.loss_scale must be a finite number (not inf/-inf/nan)")
        if v < 0:
            raise ValueError("fp16.loss_scale must be >= 0 (0 enables dynamic loss scaling)")
        return v ```
    
    
    
  • Test coverage: There are no tests included. A few unit tests in tests/unit/runtime/ asserting that invalid loss_scale values (-1, float('inf'), float('nan'), True) raise ValidationError would strengthen this PR and prevent regressions.

The existing pattern in the repo uses DeepSpeedFP16Config(loss_scale=...) directly, which makes such tests straightforward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — good suggestion. I’ll consolidate into a single mode="before" validator and add unit tests (e.g. -1, inf, nan, True -> ValidationError) using DeepSpeedFP16Config(loss_scale=...). I’ll push an update shortly. @PKUWZP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).

@nathon-lee nathon-lee force-pushed the fix_issue_7852 branch 2 times, most recently from f0059a7 to 3ead20d Compare March 7, 2026 03:20
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants