Skip to content

[QUESTION] KL Divergence Computation in DSA Uses a Mathematically Non-Equivalent Approximation #4055

@umiswing

Description

@umiswing

Your question
Ask a clear and concise question about Megatron-LM. Tag the @mcore-oncall
to get oncall's attention to this issue.

Hello, thank you for the excellent work on this project.

I noticed that in https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/experimental_attention_variant/dsa.py#L242, an epsilon value of 1e-10 is added to both P and Q during the computation of KL divergence. I understand this likely
serves as a safeguard against NaN values arising from numerical instability, which is a reasonable concern.

However, I wanted to point out that log(P + ε) - log(Q + ε) is not mathematically equivalent to log(P) - log(Q). Specifically:

$$\log\frac{P + \varepsilon}{Q + \varepsilon} \neq \log\frac{P}{Q}$$

I would be grateful if the maintainers could share any insight on the following:

  1. Has the impact of this approximation on training dynamics and convergence been evaluated?
  2. Were alternative approaches considered, such as clamping P and Q via torch.clamp prior to the logarithm?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions