Hi, thanks for your great work.
I noticed in the MixedAttention function that the following code first computes the query (q) and its interactions within the corresponding chunk.
# self attn
_, _, _, _, self_attn_out_sh, self_attn_lse_hs, _, _ = (
_flash_attn_varlen_forward(
q=q,
k=k,
v=v,
cu_seqlens_q=self_attn_cu_seqlen,
cu_seqlens_k=self_attn_cu_seqlen,
max_seqlen_q=max_seqlen,
max_seqlen_k=max_seqlen,
softmax_scale=softmax_scale,
causal=True,
dropout_p=0.0,
)
)
However, the max_seqlen is clearly larger than the maximum value in self_attn_cu_seqlen.
I would like to know if this leads to any potential issues, such as reduced computational efficiency or unintended behavior in the attention computation?
@hewr2010 @whitelez @xptree
Hi, thanks for your great work.
I noticed in the
MixedAttentionfunction that the following code first computes the query (q) and its interactions within the corresponding chunk.However, the
max_seqlenis clearly larger than the maximum value inself_attn_cu_seqlen.MoBA/moba/moba_efficient.py
Line 96 in b5d5836
I would like to know if this leads to any potential issues, such as reduced computational efficiency or unintended behavior in the attention computation?
@hewr2010 @whitelez @xptree