[Clean] Fix warnings and abnormal outputs during training #393

lmyybh · 2025-12-26T07:50:45Z

Motivation

During training, I noticed several warnings and abnormal outputs cluttering the logs:

FSDP deprecation warnings when saving checkpoints using the old state_dict_type() context manager
NCCL initialization warnings due to incorrect device binding order in distributed setup
print_on_rank0 not producing any output due to improper logger configuration
Missing model state reset after evaluation, which could lead to unexpected behavior in subsequent training steps

Modifications

Checkpoint Saving: Replace deprecated state_dict_type with get_model_state_dict
Distributed Initialization: Pass device_id parameter to suppress NCCL warnings
print_on_rank0: Replace logger.info() with print()
Evaluation:
- Call draft_model.train() after evaluation to ensure correct model state
- Show progress bar only on rank 0
- Print metrics in record_metrics only during evaluation
Dependencies: Add tensorboard package
Replace Optimizer with BF16Optimizer

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

- Update FSDP checkpoint API to deprecated warnings - Fix distributed init order to avoid device binding warnings - Fix tqdm progress bar display on non-rank-0 processes - Replace logger with print to avoid logging warnings

gemini-code-assist · 2025-12-26T07:50:49Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

FrankLeeeee · 2025-12-26T08:24:39Z

Hi, can you apply pre-commit for your code?

lmyybh · 2025-12-26T08:44:42Z

Hi, can you apply pre-commit for your code?

Done

sleepcoo · 2025-12-29T09:29:03Z

can u fix unit test？

  File "/__w/SpecForge/SpecForge/specforge/distributed.py", line 76, in init_distributed
    local_rank = int(os.environ["LOCAL_RANK"])
                     ~~~~~~~~~~^^^^^^^^^^^^^^
  File "<frozen os>", line 679, in __getitem__
KeyError: 'LOCAL_RANK'

lmyybh · 2025-12-29T09:53:35Z

can u fix unit test？

  File "/__w/SpecForge/SpecForge/specforge/distributed.py", line 76, in init_distributed
    local_rank = int(os.environ["LOCAL_RANK"])
                     ~~~~~~~~~~^^^^^^^^^^^^^^
  File "<frozen os>", line 679, in __getitem__
KeyError: 'LOCAL_RANK'

Done. And I will continuously check the unit test results and fix the errors.

lmyybh added 2 commits December 26, 2025 14:38

[Clean] Fix warnings and abnormal outputs

054c8cf

- Update FSDP checkpoint API to deprecated warnings - Fix distributed init order to avoid device binding warnings - Fix tqdm progress bar display on non-rank-0 processes - Replace logger with print to avoid logging warnings

[Improve] Add eval logging and fix model state after evaluation

f9ee3c9

lmyybh requested review from FlamingoPg, shuaills and sleepcoo as code owners December 26, 2025 07:50

apply pre-commit

ae79f9b

Add LOCAL_RANK environment variable in the unit test script

56c3ece

lmyybh requested a review from FrankLeeeee as a code owner December 29, 2025 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Clean] Fix warnings and abnormal outputs during training #393

[Clean] Fix warnings and abnormal outputs during training #393

Uh oh!

lmyybh commented Dec 26, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 26, 2025

Uh oh!

FrankLeeeee commented Dec 26, 2025

Uh oh!

lmyybh commented Dec 26, 2025

Uh oh!

sleepcoo commented Dec 29, 2025

Uh oh!

lmyybh commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Clean] Fix warnings and abnormal outputs during training #393

Are you sure you want to change the base?

[Clean] Fix warnings and abnormal outputs during training #393

Uh oh!

Conversation

lmyybh commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 26, 2025

Uh oh!

FrankLeeeee commented Dec 26, 2025

Uh oh!

lmyybh commented Dec 26, 2025

Uh oh!

sleepcoo commented Dec 29, 2025

Uh oh!

lmyybh commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lmyybh commented Dec 26, 2025 •

edited

Loading