Jk moe by kuba-krj · Pull Request #172 · llm-random/nano

kuba-krj · 2026-03-26T19:09:10Z

Adding MoE to our codebase, written with the assistance of Codex.

MFU

MFU calculated on 1 GPU is not great: ~8% with the following settings:

  dmodel: 1024
  dff: 2816
  dhead: 64
  n_blocks: 16
  q_heads: 16
  kv_heads: 16

  num_experts: 16
  num_experts_per_tok: 1
  capacity_factor: 1.25

with batch size=32 (the largest that could fit on 1 GPU), seq_len=1024. This is ~2x slower than dense with the same number of active params, trained with batch size=64 (also the largest that could fit). Possibly MFU is better on multi-gpu due to a larger batch size that we can use, but the exps are waiting in the queue, I will update when they are finished.

Correctness

I compared dense model (settings as above) with MoE where E={1, 2, 4, 16}. The results look reasonably - E=1 matches dense, and models get better with more experts.

Link to verification experiments: wandb project

j321m

claude assisted review.

I'd like to see multigpu run (smoke test)

please address the comments

j321m · 2026-03-30T09:28:16Z

remove this file from PR

j321m · 2026-03-30T09:29:37Z

remove / rename

j321m · 2026-03-30T09:31:08Z

+        moe_router_z_loss_factor: float = 0.0,
+        activation_function: str = "swiglu",
+        init_scale: float = 1.0,
+        **_ignored_kwargs,


is **_ignored_kwargs necessary?

As far as I understand our config system, it's necessary to be able to keep MoE configs similar to how it is done in small_moe.yaml, where we just set

ff_layer_fn: _target_: src.core.moe.MoE

to use MoE (because we keep ff_layer_fn from base config and only replace _ target _ in it). Please let me know if you prefer to change the config structure to sth like - override /ff_layer@model.encoder.block_fn.ff_layer_fn: moe - we can then get rid of **_ignored_kwargs

I think it's better to have separate base yamls for dense and moe, than **kwargs.

override /ff_layer@model.encoder.block_fn.ff_layer_fn: moe
is also a good idea (even better, but may need some config refactoring)

j321m · 2026-03-30T09:42:08Z

+        router_logits = router_logits.to(dtype=torch.float32)
+        router_probs = F.softmax(router_logits, dim=-1)
+        # For each token, keep only the top-k experts and their routing probabilities
+        topk_probs, selected_experts = torch.topk(


question: should the routing weights sum to 1, when num_experts_per_tok > 1?

I added the option to normalize

j321m · 2026-03-30T09:44:48Z

@@ -228,11 +312,19 @@ def _update_processed_tokens(self, batch):

    def log_metrics(self, loss, grad_norm):


function log_metrics ignores the loss argument. also i'm not sure if i like self._last_reported_loss, self._last_moe_router_z_loss, ect., it makes the code more errorprone in my opinion but I'm open for discussion

j321m · 2026-03-30T09:45:27Z

        self.metric_logger.set_tokens(self.processed_tokens)
-        self.metric_logger.log("train/loss", loss.item())
+        self.metric_logger.log("train/loss", self._last_reported_loss.item())
+        self.metric_logger.log(


MoE metrics will get logged even for dense models, do we want that?

removed moe metric logging for dense

j321m · 2026-03-30T09:51:08Z

function eval calls self.calculate_loss(batch) which overwrites self._last_reported_loss, resulting in same eval and train loss.

It works now, since log_metrics gets called after eval

j321m · 2026-03-30T09:59:59Z

+        self.moe_load_balancing_loss_factor = moe_load_balancing_loss_factor
+        self.moe_router_z_loss_factor = moe_router_z_loss_factor
+        self.is_moe = True
+        self.aux_loss = None


what is aux_loss for? it is set to the same value as load_balancing_loss

kuba-krj · 2026-04-02T11:03:14Z

I'd like to see multigpu run (smoke test)

Link to multrigpu run: wandb

kuba-krj

Added changes and ran a test run identical to the previous 2-gpu to see if the results are unchanged: wandb link. Pls let me know if the PR looks good now or if additional changes are needed

j321m · 2026-04-10T14:56:04Z

removing **kwargs is very important, separate dense and MoE config lines should solve the problem.

j321m

awesome

kuba-krj added 8 commits March 26, 2026 13:39

Add base config

7982e20

Initial impl

312d871

logging and readability

add5718

fix configs for test

d968ae3

format

9d30a87

fix bug in logging

58ecff8

update configs

397fb10

add z loss

ad26d7c

kuba-krj marked this pull request as ready for review March 28, 2026 19:14

kuba-krj requested a review from j321m March 28, 2026 19:14

j321m reviewed Mar 30, 2026

View reviewed changes

kuba-krj added 6 commits April 2, 2026 13:05

rename config

c49bc50

Revert entropy cluster config change

88f617a

remove aux_loss

4a0134f

fix logging and add moe normalization

b63a6b7

refactor logging

b5d722c

Reformat

d299fca

kuba-krj commented Apr 3, 2026

View reviewed changes

kuba-krj added 2 commits April 14, 2026 13:41

modify configs

488ec0b

update configs

10c48c2

j321m approved these changes Apr 14, 2026

View reviewed changes

kuba-krj merged commit d12c8f0 into main Apr 14, 2026
1 check passed

		@@ -228,11 +312,19 @@ def _update_processed_tokens(self, batch):

		def log_metrics(self, loss, grad_norm):

Conversation

kuba-krj commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MFU

Correctness

Uh oh!

j321m left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kuba-krj commented Apr 2, 2026

Uh oh!

kuba-krj left a comment

Choose a reason for hiding this comment

Uh oh!

j321m commented Apr 10, 2026

Uh oh!

j321m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kuba-krj commented Mar 26, 2026 •

edited

Loading