fix: Validate fp16.loss_scale is finite and non-negative#7889
fix: Validate fp16.loss_scale is finite and non-negative#7889nathon-lee wants to merge 2 commits intodeepspeedai:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0059a795a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Loss scaling value. Default value of 0 means dynamic loss scaling instead of static loss scale. | ||
| """ | ||
|
|
||
| @field_validator("loss_scale") |
There was a problem hiding this comment.
Run loss_scale validator before type coercion
This validator is declared with the default mode="after", so Pydantic will coerce inputs to float first; as a result, the new isinstance(v, bool) guard never triggers because true/false become 1.0/0.0 before _validate_loss_scale runs. In configs that set fp16.loss_scale to a boolean, the value is still silently accepted, which defeats the stated validation goal and can unexpectedly switch to static scaling (true -> 1.0).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think this comment makes sense. Can you address it? @nathon-lee
There was a problem hiding this comment.
Thanks — I agree this comment makes sense. I’ll address it and push an update shortly. @tohtana
There was a problem hiding this comment.
Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).
PKUWZP
left a comment
There was a problem hiding this comment.
Switch to mode = before and add some tests.
| """ | ||
|
|
||
| @field_validator("loss_scale") | ||
| @classmethod |
There was a problem hiding this comment.
-
Consider using
mode="before"for the entire validator rather than splitting into two validators. A single
mode="before"validator can handle both the bool check and the finite/negative checks:@classmethod def _validate_loss_scale(cls, v): if isinstance(v, bool): raise ValueError("fp16.loss_scale must be a number, not bool") v = float(v) if not math.isfinite(v): raise ValueError("fp16.loss_scale must be a finite number (not inf/-inf/nan)") if v < 0: raise ValueError("fp16.loss_scale must be >= 0 (0 enables dynamic loss scaling)") return v ``` -
Test coverage: There are no tests included. A few unit tests in
tests/unit/runtime/asserting that invalid loss_scale values(-1, float('inf'), float('nan'), True)raiseValidationErrorwould strengthen this PR and prevent regressions.
The existing pattern in the repo uses DeepSpeedFP16Config(loss_scale=...) directly, which makes such tests straightforward.
There was a problem hiding this comment.
Thanks — good suggestion. I’ll consolidate into a single mode="before" validator and add unit tests (e.g. -1, inf, nan, True -> ValidationError) using DeepSpeedFP16Config(loss_scale=...). I’ll push an update shortly. @PKUWZP
There was a problem hiding this comment.
Thanks for the review — addressed: the loss_scale validator now runs with mode="before" (so bools are rejected prior to coercion) and I added unit tests for (-1, inf, nan, True).
f0059a7 to
5c7b12e
Compare
f0059a7 to
3ead20d
Compare
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
3ead20d to
225ab4e
Compare
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
9ad6000 to
f2fc309
Compare
Validate fp16.loss_scale is finite and non-negative
Add a Pydantic field validator to DeepSpeedFP16Config to reject NaN/inf/-inf and negative values for fp16.loss_scale (while keeping 0 as dynamic loss scaling). This prevents invalid configs from silently initializing and causing NaNs during training.
Fix issue #7852