feat: Reduce reference model memory with with parallel logprob computation #608

gitlost-murali · 2025-11-30T18:30:53Z

Summary

When tensor parallelism is enabled, the reference model's logits are sharded across GPUs on the vocabulary dimension. Previously, we called full_tensor() to gather the complete vocab on each GPU before computing log probabilities.

This PR adds compute_logprobs_parallel() that computes log probabilities distributedly using the log-sum-exp trick across shards.

Memory savings (measured)

Scenario	Memory per GPU
Old (full_tensor + compute_logprobs)	58 GB
New (parallel logprobs)	34 GB
Saved	24 GB (~41%)

Old state usage:

New state (Parallel logprobs based) usage:

Tested with batch=4, seq_len=9k (1024 prompt tokens + 8192 response tokens), vocab=150k, TP=2

Changes

New: src/forge/util/parallel_logprobs.py - distributed log-prob computation for vocab-sharded DTensors
New: tests/unit_tests/util/test_parallel_logprobs.py - correctness tests against sequential implementation
Modified: src/forge/actors/reference_model.py - uses parallel version when TP is enabled

Implementation

Uses distributed log-softmax without gathering:

All-reduce MAX for numerical stability
All-reduce SUM of local exp(x - max)
Each rank gathers logits only for tokens in its shard
All-reduce SUM to combine (only owning rank contributes)

Testing

Verified results match compute_logprobs() within 1e-5 tolerance
Tested temperature scaling, alignment modes, numerical stability with extreme values
Tested 2-way vocab sharded config

joecummings

I like this idea!

Could I ask for a few things?

WandB logs that show the memory saved. This is always helpful as a part of verifying the correctness.
Combine the parallel_logprobs and regular logprobs in the same file. No need to split that out just yet.
Look for ways that this code could be simplified and/or factored out. Claude can be very verbose :)

Looking forward to getting this landed!

gitlost-murali · 2025-12-01T23:44:46Z

Thanks for the review @joecummings !

I refactored the code as per feedback. Less Claude footprint now :). Let me know if the code needs to be further simplified/refactored.

I attached wandb chart images in the description. Also attaching it here:

Old state usage:

New state (Parallel logprobs based) usage:

gitlost-murali · 2025-12-03T20:48:14Z

Hi @felipemello1,

The unit tests were failing as pytz was missing from CI env. I rebased on main now. Looks like #618 (easy - remove pytz) takes care of this

Ran the tests locally. All pass. Can you trigger the tests again please?

Thanks!

felipemello1 · 2025-12-03T20:56:01Z

@gitlost-murali, thanks for opening the PR. Great results!

can you try to run the non-sharded version but compile F.cross_entropy? e.g.

@torch.compile()
def compute_logprobs(...):
    ...

I think that simply compiling it greatly reduces the memory, since it never materializes the intermediate activations. Maybe something to do in addition to your work and not in place of your work. I am skeptical about using the log-sum-exp directly and not F.cross_entropy, since the compiled version is highly optimized.

Also, you might be interested in checking Nathan's old PRs in torchtune: meta-pytorch/torchtune#2782

gitlost-murali · 2025-12-05T11:55:37Z

Hi @felipemello1 ,

Thanks for the suggestion! torch.compile greatly reduced the memory usage.

The non-sharded version usage went down from 58GB to 27GB
And the sharded version usage went down from 34GB to 7GB

I am skeptical about using the log-sum-exp directly and not F.cross_entropy, since the compiled version is highly optimized.

As the compiled version doesn't spike the memory, I agree with the skepticism.

Currently, the reference model handles around ~9k seq-len. For multi-turn setup, the seq-len would further increase. This is where we can benefit from the sharded version as it avoids the all gather (.full_tensor()). But if you think the sharded version (log-sum-exp) is an overkill or mixing the levels of abstraction, I am happy to reduce this MR to just adding the decorator on current non-sharded version

felipemello1 · 2025-12-05T17:19:48Z

oh wow, better than i expected! thanks for doing it.

I am happy to reduce this MR to just adding the decorator on current non-sharded version

Here is what i am thinking, let me know if you agree:
(1) yes, we should have a PR where we enable torch.compile on this function. Decorator is easy, but there is no way to disable it if for some reason we need to. The reference model already has a flag for compile. We can do in post_init:

self.compute_log_probs = compute_logprobs
if compile:
   self.compute_log_probs = torch.compile(self.compute_log_probs)

Would you like to take this one?

(2) regarding the loss parallel, i am not super familiar with it, but it seems that distributed already has APIs for it for TP and context parallelism.

TP: https://github.com/meta-pytorch/torchtune/blob/67ab86b94de9e7ac7dd9850113ebe69e2bbd307c/torchtune/training/_distributed.py#L894

CP: https://github.com/meta-pytorch/torchtune/blob/67ab86b94de9e7ac7dd9850113ebe69e2bbd307c/torchtune/training/_distributed.py#L844

I think that using those would be more robust. Perhaps check if TorchTitan already does it. I will ask internally and get back to you.

felipemello1

comments above

felipemello1 · 2025-12-05T18:05:02Z

This is what i was told:

Now in torchtitan, when TP is enabled, we keep the output sharded and use loss_parallel context to calculate loss using sharded logits: https://docs.pytorch.org/docs/stable/distributed.tensor.parallel.html#torch.distributed.tensor.parallel.loss_parallel

so it seems that its easier than we though :) . Just dont call .full_tensor before the F.cross_entropy.

felipemello1 · 2025-12-05T20:05:28Z

Another reply i got:

CP loss computation is naturally parallelized, along seq dim. Similar for DP.

For TP you can parallelize along vocab dimension, using loss_parallel context https://github.com/pytorch/torchtitan/blob/1168f9e4d58bbd91c07b08c382d1ca3ae4b2e02c/torchtitan/distributed/utils.py#L

i think we just need to call the loss under the context parallel context

gitlost-murali · 2025-12-06T00:35:33Z

Amazing! Thanks a lot for the pointers! This simplifies the code a lot and I like that compile is configurable

I updated the code to use loss_parallel context manager. Here are the numbers for sharded computation with loss_parallel():

39 GB with no-compile vs 7 GB with compile

Btw, should we leave the compile.enable flag as false in yml files? or change this to true?

  compile:
    enable: false

Overall summary (no-compile vs compile):

The non-sharded version: 58GB vs 27GB
Sharded version (custom shading logic): 34GB vs 7GB
Sharded version (current version with loss_parallel() ctx manager): 39GB vs 7GB

gitlost-murali

Changes addressed

felipemello1 · 2025-12-08T16:54:05Z

awesome, thanks @gitlost-murali . I will try to get to it between today and tomorrow.

…reference model This update introduces the function to compute log probabilities without gathering the full vocabulary tensor across GPUs.

…ogprobs

…ructure

… logprobs

…ge in non-sharded and sharded computation

…ontext

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 30, 2025

gitlost-murali changed the title ~~feat: Distributed log-prob computation for vocab-sharded reference model~~ feat: Optimize reference model GPU usage by distributed log-prob computation on vocab-sharded logits Nov 30, 2025

gitlost-murali changed the title ~~feat: Optimize reference model GPU usage by distributed log-prob computation on vocab-sharded logits~~ feat: Reduce reference model memory usage with distributed log-probs comp Nov 30, 2025

gitlost-murali changed the title ~~feat: Reduce reference model memory usage with distributed log-probs comp~~ feat: Reduce reference model memory with with parallel logprob computation Nov 30, 2025

joecummings reviewed Dec 1, 2025

View reviewed changes

gitlost-murali requested a review from joecummings December 1, 2025 23:44

gitlost-murali force-pushed the optimize-ref-model-usage branch from 53ddb5b to 20f59bf Compare December 3, 2025 20:36

gitlost-murali force-pushed the optimize-ref-model-usage branch from 78b4913 to da54044 Compare December 5, 2025 11:25

felipemello1 requested changes Dec 5, 2025

View reviewed changes

gitlost-murali requested a review from felipemello1 December 6, 2025 00:42

gitlost-murali force-pushed the optimize-ref-model-usage branch from 2c15e66 to 5fa2fc6 Compare December 6, 2025 00:45

gitlost-murali commented Dec 6, 2025

View reviewed changes

gitlost-murali added 9 commits December 8, 2025 22:00

feat: Add parallel log-prob computation for vocab-sharded tensors in …

ea74704

…reference model This update introduces the function to compute log probabilities without gathering the full vocabulary tensor across GPUs.

refactor: use util function

a5b0348

refactor: remove redundant code and reuse compute_logprobs function

8b5d6bb

test: add test to verify correctness with compute_logprobs

72fa4c3

chore: make routing explicit

540253d

style: clean up inline comments in parallel_logprobs

f656334

refactor: move compute_logprobs_parallel into ops alongside compute_l…

309ba5c

…ogprobs

refactor: simplify and refactor logprobs parallel method

c0f21ca

test: move parallel logprobs test to test_ops.py reflecting folder st…

72f58ce

…ructure

gitlost-murali added 7 commits December 8, 2025 22:00

chore: merge single use declarations

59b8799

fix: safely handle edgecase of uneven vocab shards in tensor-parallel…

ec6d0ae

… logprobs

fix: fix compute_logprobs_parallel import

c79740a

feat: add torch compile to further reduce the reference model GPU usa…

805bbf7

…ge in non-sharded and sharded computation

refactor: Simplify sharded logprobs computation using loss_parallel c…

c0aedca

…ontext

fix: convert DTensor output to regular tensor after loss_parallel

90120bb

refactor: make compile configurable for logprobs computation

f92b503

gitlost-murali force-pushed the optimize-ref-model-usage branch from 5fa2fc6 to f92b503 Compare December 8, 2025 22:00

refactor: remove redundant logging steps as on main

bc7be03

feat: Reduce reference model memory with with parallel logprob computation #608

Are you sure you want to change the base?

feat: Reduce reference model memory with with parallel logprob computation #608

Uh oh!

Conversation

gitlost-murali commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Memory savings (measured)

Changes

Implementation

Testing

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

gitlost-murali commented Dec 1, 2025

Uh oh!

gitlost-murali commented Dec 3, 2025

Uh oh!

felipemello1 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitlost-murali commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felipemello1 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felipemello1 left a comment

Choose a reason for hiding this comment

Uh oh!

felipemello1 commented Dec 5, 2025

Uh oh!

felipemello1 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitlost-murali commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitlost-murali left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gitlost-murali commented Nov 30, 2025 •

edited

Loading

felipemello1 commented Dec 3, 2025 •

edited

Loading

gitlost-murali commented Dec 5, 2025 •

edited

Loading

felipemello1 commented Dec 5, 2025 •

edited

Loading

felipemello1 commented Dec 5, 2025 •

edited

Loading

gitlost-murali commented Dec 6, 2025 •

edited

Loading

gitlost-murali left a comment •

edited

Loading