Skip to content

Feature Request: Dual Chunk Attention for Long Context #2797

@sbhavani

Description

@sbhavani

Summary

Request implementation of Dual Chunk Attention (DCA), the technique used by Qwen 2/2.5 for efficient long-context training and inference on 100K+ token sequences.

Motivation

Training on sequences >100K tokens is essential for document understanding, code generation, and video models. Qwen 2.5 demonstrates that DCA enables 128K context with sub-quadratic memory:

  • Quadratic memory growth - Standard attention scales O(n²), making 128K+ sequences impractical even with FlashAttention
  • CP doesn't reduce attention complexity - Context Parallelism helps with parallelism but still computes full attention
  • YaRN/ABF are position-encoding only - They extend positional range but don't address attention memory

DCA reduces memory from O(n²) to O(n·c) where c = chunk size, by combining local intra-chunk attention with global inter-chunk attention.

Current State

Megatron has strong long-context parallelism (CP with p2p, a2a, allgather, hierarchical CP, YaRN, FlashAttention) but lacks algorithmic attention optimizations like chunked or sparse attention patterns.

Ask

  1. New attention module - DualChunkAttention with intra-chunk (local) + inter-chunk (global) attention
  2. Integration with Context Parallelism, FlashAttention, and GQA/MQA

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions