Summary
Request implementation of Dual Chunk Attention (DCA), the technique used by Qwen 2/2.5 for efficient long-context training and inference on 100K+ token sequences.
Motivation
Training on sequences >100K tokens is essential for document understanding, code generation, and video models. Qwen 2.5 demonstrates that DCA enables 128K context with sub-quadratic memory:
- Quadratic memory growth - Standard attention scales O(n²), making 128K+ sequences impractical even with FlashAttention
- CP doesn't reduce attention complexity - Context Parallelism helps with parallelism but still computes full attention
- YaRN/ABF are position-encoding only - They extend positional range but don't address attention memory
DCA reduces memory from O(n²) to O(n·c) where c = chunk size, by combining local intra-chunk attention with global inter-chunk attention.
Current State
Megatron has strong long-context parallelism (CP with p2p, a2a, allgather, hierarchical CP, YaRN, FlashAttention) but lacks algorithmic attention optimizations like chunked or sparse attention patterns.
Ask
- New attention module -
DualChunkAttention with intra-chunk (local) + inter-chunk (global) attention
- Integration with Context Parallelism, FlashAttention, and GQA/MQA
References
Summary
Request implementation of Dual Chunk Attention (DCA), the technique used by Qwen 2/2.5 for efficient long-context training and inference on 100K+ token sequences.
Motivation
Training on sequences >100K tokens is essential for document understanding, code generation, and video models. Qwen 2.5 demonstrates that DCA enables 128K context with sub-quadratic memory:
DCA reduces memory from O(n²) to O(n·c) where c = chunk size, by combining local intra-chunk attention with global inter-chunk attention.
Current State
Megatron has strong long-context parallelism (CP with
p2p,a2a,allgather, hierarchical CP, YaRN, FlashAttention) but lacks algorithmic attention optimizations like chunked or sparse attention patterns.Ask
DualChunkAttentionwith intra-chunk (local) + inter-chunk (global) attentionReferences