Skip to content

Fix minimax type mismatch#49

Merged
yubofredwang merged 5 commits intomainfrom
ywang/fix-minimax-type-mismatch
Mar 20, 2026
Merged

Fix minimax type mismatch#49
yubofredwang merged 5 commits intomainfrom
ywang/fix-minimax-type-mismatch

Conversation

@yubofredwang
Copy link
Collaborator

Fix minimax type mismatch

SGLang may load models in float16 (e.g. MiniMax-M2.5) while training
runs in bfloat16. Without an explicit cast, float16 bytes were stored
and later interpreted as bfloat16, silently corrupting training data.

Introduce HIDDEN_STATES_STORAGE_DTYPE as a single source of truth and
cast hidden_states/last_hidden_states/target in EagleMooncakeStore.put()
so both SGLang and vLLM paths are covered.
@yubofredwang yubofredwang marked this pull request as ready for review March 20, 2026 01:32
Copilot AI review requested due to automatic review settings March 20, 2026 01:32
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: faa46ef742

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes the dtype used for Eagle/Mooncake hidden-state storage to address dtype mismatches (“minimax type mismatch”) by introducing a canonical storage dtype and applying it across Mooncake put/get metadata in inference engines.

Changes:

  • Introduces HIDDEN_STATES_STORAGE_DTYPE (canonical hidden-state storage dtype) and casts tensors to it on EagleMooncakeStore.put().
  • Updates EagleMooncakeStore.get() defaults to use the canonical dtype.
  • Updates SGL and vLLM engines’ Mooncake metadata dtypes to reference the canonical dtype.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
torchspec/transfer/mooncake/eagle_store.py Adds canonical dtype constant; casts tensors before writing; uses canonical dtype as default for reads.
torchspec/inference/engine/vllm_engine.py Uses canonical dtype constant when reporting Mooncake tensor dtypes.
torchspec/inference/engine/sgl_engine.py Uses canonical dtype constant when reporting Mooncake tensor dtypes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The previous commit casts hidden states to bfloat16 inside
EagleMooncakeStore.put(), but the vLLM worker extension and HF runner
still reported the original pre-cast dtype in their metadata dicts.
Since the training-side data fetcher trusts that metadata to decode
Mooncake bytes, the mismatch would silently corrupt reads.

Both emitters now report HIDDEN_STATES_STORAGE_DTYPE so metadata and
stored bytes agree.
Make put() the single source of truth for both shapes and dtypes by
returning {"shapes": ..., "dtypes": ...} from the post-cast tensors.
Callers now use the store's return value instead of reading dtypes from
their own pre-cast local variables.

This eliminates the class of bugs where a producer emits metadata with
the wrong dtype because put() silently cast under the hood.
Copilot AI review requested due to automatic review settings March 20, 2026 01:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yubofredwang yubofredwang merged commit 985f90d into main Mar 20, 2026
5 checks passed
@yubofredwang yubofredwang deleted the ywang/fix-minimax-type-mismatch branch March 20, 2026 01:51
zhubohao911 pushed a commit to zhubohao911/TorchSpec that referenced this pull request Mar 23, 2026
zhubohao911 pushed a commit to zhubohao911/TorchSpec that referenced this pull request Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants