Skip to content

Conversation

@tanmaysachan
Copy link

@tanmaysachan tanmaysachan commented Jan 16, 2026

Addresses #865

  • Model outline from pytorch -> jax
  • parity checks
  • Tests
  • Address drift
  • Train end to end

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the JAX implementation for the DeepseekV3 model. The implementation is comprehensive and covers the model's unique features like Multi-Head Latent Attention and Mixture of Experts with shared experts. The code is well-structured.

My review focuses on a critical bug that will prevent the model from running, along with some suggestions to improve maintainability by reducing code duplication and avoiding magic numbers. Addressing these points will make the implementation more robust and easier to maintain.

Comment on lines 527 to 543
# Precompute RoPE frequencies
# qk_rope_head_dim = config.qk_rope_head_dim
# original_seq_len = getattr(config, "original_seq_len", config.max_position_embeddings)
# rope_factor = getattr(config, "rope_factor", 1.0)
# beta_fast = getattr(config, "beta_fast", 32)
# beta_slow = getattr(config, "beta_slow", 1)

# TODO: Swap out like llama's rope?
# self.freqs_cis = precompute_freqs_cis(
# dim=qk_rope_head_dim,
# max_seq_len=config.max_position_embeddings,
# rope_theta=config.rope_theta,
# original_seq_len=original_seq_len,
# rope_factor=rope_factor,
# beta_fast=beta_fast,
# beta_slow=beta_slow,
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block for precomputing RoPE frequencies is commented out, but self.freqs_cis is used in DeepseekV3Model.__call__ at line 571. This will raise an AttributeError at runtime.

Looking at the DeepseekV3MLA implementation, the freqs_cis parameter is not used. Instead, apply_rope is called, which computes the frequencies on the fly.

To fix this, you should remove the freqs_cis parameter from the entire call chain, as it appears to be unused. This involves:

  1. Removing freqs_cis: jax.Array from the signature of DeepseekV3MLA.__call__.
  2. Removing freqs_cis: jax.Array from the signature of DeepseekV3DecoderLayer.__call__.
  3. Removing the freqs_cis=self.freqs_cis argument from the layer() call within DeepseekV3Model.__call__.

This will resolve the crash and align the code with the current apply_rope implementation. You can then address the TODO about swapping the RoPE implementation in a separate change.

)

# Bias only for specific model sizes (7168 hidden_size in original)
self.use_bias = config.hidden_size == 7168
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the magic number 7168 to determine self.use_bias makes the code brittle and less maintainable. If a new model variant is introduced that also requires this bias, this line would need to be updated. A better approach would be to introduce a dedicated boolean flag in the DeepseekV3Config, such as use_router_bias, to control this behavior explicitly.

# Bias only for specific model sizes (7168 hidden_size in original)
self.use_bias = config.hidden_size == 7168
if self.use_bias:
from tx.layers.util import Param
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This import statement is located inside a conditional block within __init__. According to PEP 8, all imports should be at the top of the file. This improves code readability and avoids potential circular import issues or unexpected behavior. Please move from tx.layers.util import Param to the top of the file with the other imports.

Comment on lines 413 to 415
class DeepseekV3SharedMLP(nnx.Module):
"""Always active shared experts."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The DeepseekV3SharedMLP class is nearly identical to DeepseekV3MLP, with the only significant difference being the intermediate_size. This creates code duplication, which can make maintenance harder.

To improve this, consider refactoring them into a single, more generic MLP class (e.g., SwiGLU) that accepts intermediate_size as a parameter in its __init__ method. You could then instantiate this class with config.intermediate_size for the standard MLP and with the calculated shared_inter_dim for the shared MLP part.

@pcmoritz pcmoritz added the tx label Jan 17, 2026
@tanmaysachan
Copy link
Author

tanmaysachan commented Jan 18, 2026

@pcmoritz The PR is open for reviews now

In the first test case I've added a todo - there seems to be some kind of drift which requires absolute tolerance to be around ~6e-3 for tests to pass. I'll investigate a little more, nothing seemed to have caught my eye so far

@tanmaysachan
Copy link
Author

Fixed the source of the drift, there was a default config mismatch

@pcmoritz
Copy link
Collaborator

This is awesome! Have you already gotten some end-to-end training working with it? It would be great to add one to https://github.com/NovaSky-AI/SkyRL/blob/main/skyrl-tx/README.md. If you haven't I'm also more than happy to help with it :)

@tanmaysachan
Copy link
Author

tanmaysachan commented Jan 21, 2026

Looks like the tests are failing, unable to replicate this on my machine somehow.
Having a look

Some qwen tests also seem to be failing - is this expected?

Have not been able to train end-to-end yet, will give it a shot over the weekend! (with any further fixes required). Added a task for it in the PR description

@tanmaysachan
Copy link
Author

tanmaysachan commented Jan 23, 2026

Failing tests root cause: Huggingface outputs are not consistent between MacOS and Ubuntu (Accelerate vs MKL)

Linux

OS: Linux 6.8.0-90-genericMachine: x86_64Python: 3.12.12PyTorch: 2.10.0+cu128PyTorch BLAS: mklCUDA available: FalseTransformers: 4.57.6

DEEPSEEK V3 TEST

HF hidden_states[-1] first 10 values (sample 0, pos 0):
[-0.05490041896700859, -0.6639361381530762, -0.4137983024120331, 0.19858041405677795, 0.4002900719642639, -1.8006019592285156, -0.7636783123016357, -0.6883448958396912, 0.39694416522979736, 2.5040738582611084]

Macos

OS: Darwin 25.2.0Machine: arm64PyTorch BLAS: accelerateTransformers: 4.57.6

DEEPSEEK V3 TEST
HF hidden_states[-1] first 10 values (sample 0, pos 0):
[-0.0496, -0.6667, -0.4240, 0.1903, 0.4095, -1.8056, -0.7479, -0.6778, 0.3872, 2.5022]

Bumping thresholds

@vercel
Copy link

vercel bot commented Jan 25, 2026

@tanmaysachan is attempting to deploy a commit to the Tyler's projects Team on Vercel.

A member of the Team first needs to authorize it.

- Add LogitsProcessorMixin to DeepseekV3ForCausalLM
- Add get_lm_head() method for logits computation
- Fix broken compute_positions import
- Fix init_lora_adapter to handle n_routed_experts attribute
- Add test_deepseekv3_lora_training.py with MoE rank normalization tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@tanmaysachan
Copy link
Author

tanmaysachan commented Jan 25, 2026

Accidental deployment attempt due to rebase ^

@tanmaysachan
Copy link
Author

tanmaysachan commented Jan 25, 2026

End-to-end training successfull on an A100.

/api/v1/healthz -> {"status":"ok"}

Added GPU tests (need anyscale creds to run)

GPU tests on A100:

(skyrl-tx) (main) root@C.30490604:/workspace/SkyRL/skyrl-tx$ uv run --extra gpu python -m pytest tests/models/test_deepseekv3_lora_training.py -v
======================================================================================= test session starts ========================================================================================
platform linux -- Python 3.11.14, pytest-9.0.2, pluggy-1.6.0 -- /workspace/SkyRL/skyrl-tx/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /workspace/SkyRL/skyrl-tx
configfile: pyproject.toml
plugins: anyio-4.12.1
collected 2 items

tests/models/test_deepseekv3_lora_training.py::test_lora_training_moe_rank_normalized PASSED [ 50%]
tests/models/test_deepseekv3_lora_training.py::test_lora_training_high_rank PASSED [100%]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants