Skip to content

fsdp: improve state saving determinism and tree reconstruction#3407

Open
coder-jayp wants to merge 1 commit intoml-explore:mainfrom
coder-jayp:fsdp-saving-state-improvement
Open

fsdp: improve state saving determinism and tree reconstruction#3407
coder-jayp wants to merge 1 commit intoml-explore:mainfrom
coder-jayp:fsdp-saving-state-improvement

Conversation

@coder-jayp
Copy link
Copy Markdown

@coder-jayp coder-jayp commented Apr 13, 2026

Addresses #3401.

Changes:

  • Made parameter grouping deterministic by sorting keys returned from tree_flatten(). This fixes the non-deterministic grouping issue that could cause incorrect state loading.
  • Added intermediate reconstruction of the original parameter tree before calling optimizer.apply_gradients.
    This ensures optimizer state is no longer tied to the number of communication groups or communication_size.

This change improves the reliability of saving and loading FSDP optimizer states while preserving the performance benefits of grouped reduce-scatter / all-gather operations.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@nastya236 nastya236 self-requested a review April 13, 2026 18:41
@coder-jayp coder-jayp changed the title fsdp: improve state saving determinism and tree reconstruction (#3401) fsdp: improve state saving determinism and tree reconstruction Apr 13, 2026
@coder-jayp
Copy link
Copy Markdown
Author

Hi @nastya236, it would be great to get some feedback. Thanks.

@nastya236
Copy link
Copy Markdown
Collaborator

Thanks for your contribution! Will look at it as soon as possible.

@coder-jayp
Copy link
Copy Markdown
Author

Hi @nastya236,

Just a quick follow-up on this PR. Happy to make any changes if needed. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants