Feature Request: Dual Chunk Attention for Long Context

## Summary

Request implementation of Dual Chunk Attention (DCA), the technique used by Qwen 2/2.5 for efficient long-context training and inference on 100K+ token sequences.

## Motivation

Training on sequences >100K tokens is essential for document understanding, code generation, and video models. Qwen 2.5 demonstrates that DCA enables 128K context with sub-quadratic memory:

- **Quadratic memory growth** - Standard attention scales O(n²), making 128K+ sequences impractical even with FlashAttention
- **CP doesn't reduce attention complexity** - Context Parallelism helps with parallelism but still computes full attention
- **YaRN/ABF are position-encoding only** - They extend positional range but don't address attention memory

DCA reduces memory from O(n²) to O(n·c) where c = chunk size, by combining local intra-chunk attention with global inter-chunk attention.

## Current State

Megatron has strong long-context parallelism (CP with `p2p`, `a2a`, `allgather`, hierarchical CP, YaRN, FlashAttention) but lacks algorithmic attention optimizations like chunked or sparse attention patterns.

## Ask

1. **New attention module** - `DualChunkAttention` with intra-chunk (local) + inter-chunk (global) attention
2. **Integration** with Context Parallelism, FlashAttention, and GQA/MQA

## References

- [Qwen2 Technical Report](https://arxiv.org/abs/2407.10671) - DCA for 128K context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Dual Chunk Attention for Long Context #2797

Summary

Motivation

Current State

Ask

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Dual Chunk Attention for Long Context #2797

Description

Summary

Motivation

Current State

Ask

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions