Your question
Ask a clear and concise question about Megatron-LM. Tag the @mcore-oncall
to get oncall's attention to this issue.
Hello, thank you for the excellent work on this project.
I noticed that in https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/experimental_attention_variant/dsa.py#L242, an epsilon value of 1e-10 is added to both P and Q during the computation of KL divergence. I understand this likely
serves as a safeguard against NaN values arising from numerical instability, which is a reasonable concern.
However, I wanted to point out that log(P + ε) - log(Q + ε) is not mathematically equivalent to log(P) - log(Q). Specifically:
$$\log\frac{P + \varepsilon}{Q + \varepsilon} \neq \log\frac{P}{Q}$$
I would be grateful if the maintainers could share any insight on the following:
- Has the impact of this approximation on training dynamics and convergence been evaluated?
- Were alternative approaches considered, such as clamping P and Q via torch.clamp prior to the logarithm?
Your question
Ask a clear and concise question about Megatron-LM. Tag the @mcore-oncall
to get oncall's attention to this issue.
Hello, thank you for the excellent work on this project.
I noticed that in https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/experimental_attention_variant/dsa.py#L242, an epsilon value of 1e-10 is added to both P and Q during the computation of KL divergence. I understand this likely
serves as a safeguard against NaN values arising from numerical instability, which is a reasonable concern.
However, I wanted to point out that log(P + ε) - log(Q + ε) is not mathematically equivalent to log(P) - log(Q). Specifically:
I would be grateful if the maintainers could share any insight on the following: