[QUESTION] KL Divergence Computation in DSA Uses a Mathematically Non-Equivalent Approximation

**Your question**
Ask a clear and concise question about Megatron-LM. Tag the [@mcore-oncall](https://github.com/orgs/NVIDIA/teams/mcore-oncall) 
to get oncall's attention to this issue.

  Hello, thank you for the excellent work on this project.                                                                                                                                                                                                        
   
  I noticed that in https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/experimental_attention_variant/dsa.py#L242, an epsilon value of 1e-10 is added to both P and Q during the computation of KL divergence. I understand this likely    
  serves as a safeguard against NaN values arising from numerical instability, which is a reasonable concern.

  However, I wanted to point out that log(P + ε) - log(Q + ε) is not mathematically equivalent to log(P) - log(Q). Specifically:

  $$\log\frac{P + \varepsilon}{Q + \varepsilon} \neq \log\frac{P}{Q}$$

  I would be grateful if the maintainers could share any insight on the following:

  1. Has the impact of this approximation on training dynamics and convergence been evaluated?
  2. Were alternative approaches considered, such as clamping P and Q via torch.clamp prior to the logarithm?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] KL Divergence Computation in DSA Uses a Mathematically Non-Equivalent Approximation #4055

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION] KL Divergence Computation in DSA Uses a Mathematically Non-Equivalent Approximation #4055

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions