Skip to content

Conversation

@lmyybh
Copy link

@lmyybh lmyybh commented Dec 26, 2025

Motivation

During training, I noticed several warnings and abnormal outputs cluttering the logs:

  1. FSDP deprecation warnings when saving checkpoints using the old state_dict_type() context manager
  2. NCCL initialization warnings due to incorrect device binding order in distributed setup
  3. print_on_rank0 not producing any output due to improper logger configuration
  4. Missing model state reset after evaluation, which could lead to unexpected behavior in subsequent training steps

Modifications

  1. Checkpoint Saving: Replace deprecated state_dict_type with get_model_state_dict
  2. Distributed Initialization: Pass device_id parameter to suppress NCCL warnings
  3. print_on_rank0: Replace logger.info() with print()
  4. Evaluation:
    • Call draft_model.train() after evaluation to ensure correct model state
    • Show progress bar only on rank 0
    • Print metrics in record_metrics only during evaluation
  5. Dependencies: Add tensorboard package
  6. Replace Optimizer with BF16Optimizer

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

- Update FSDP checkpoint API to deprecated warnings
- Fix distributed init order to avoid device binding warnings
- Fix tqdm progress bar display on non-rank-0 processes
- Replace logger with print to avoid logging warnings
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@FrankLeeeee
Copy link
Collaborator

Hi, can you apply pre-commit for your code?

@lmyybh
Copy link
Author

lmyybh commented Dec 26, 2025

Hi, can you apply pre-commit for your code?

Done

@sleepcoo
Copy link
Collaborator

can u fix unit test?

  File "/__w/SpecForge/SpecForge/specforge/distributed.py", line 76, in init_distributed
    local_rank = int(os.environ["LOCAL_RANK"])
                     ~~~~~~~~~~^^^^^^^^^^^^^^
  File "<frozen os>", line 679, in __getitem__
KeyError: 'LOCAL_RANK'

@lmyybh lmyybh requested a review from FrankLeeeee as a code owner December 29, 2025 09:49
@lmyybh
Copy link
Author

lmyybh commented Dec 29, 2025

can u fix unit test?

  File "/__w/SpecForge/SpecForge/specforge/distributed.py", line 76, in init_distributed
    local_rank = int(os.environ["LOCAL_RANK"])
                     ~~~~~~~~~~^^^^^^^^^^^^^^
  File "<frozen os>", line 679, in __getitem__
KeyError: 'LOCAL_RANK'

Done. And I will continuously check the unit test results and fix the errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants