Add Dual Chunk Attention (DCA) for long-context training by Ternura143 · Pull Request #4048 · NVIDIA/Megatron-LM

Ternura143 · 2026-03-29T17:06:01Z

What does this PR do ?

Implement Dual Chunk Attention (DCA) for efficient long-context training on 100K+ token sequences with sub-quadratic memory complexity.

Resolves #2797.

Changes

New: megatron/core/transformer/experimental_attention_variant/dca.py
- DualChunkAttention module with three attention components:
  - Intra-chunk: standard causal attention within each chunk
  - Successive-chunk: locality-preserving attention to the immediately preceding chunk
  - Inter-chunk: fixed-distance attention to all earlier chunks
- LSE-based output merging for correct softmax renormalization across chunks
- FlashAttention backend with native GQA support (auto-fallback to unfused on CPU)
- YARN mscale integration for RoPE concentration factor
- Seamless fallback to standard attention for sequences shorter than chunk_len
Modified: transformer_config.py — Add dca_chunk_size (default: 8192), dca_local_size (default: 1024) config parameters with validation
Modified: attention.py — DCA integration: skip standard RoPE and pass rotary_pos_emb to DCA core_attention
Modified: experimental_attention_variant_module_specs.py — Add get_dca_module_spec() and register "dca" in the experimental attention variant framework
New: tests/unit_tests/transformer/test_attention_variant_dca.py — Unit tests for output shape, short-sequence equivalence, GQA, gradient flow, multi-chunk, causality, YARN mscale, FlashAttention

Usage

config = TransformerConfig(
    experimental_attention_variant="dca",
    dca_chunk_size=8192,
    dca_local_size=1024,
)

Status: Draft

Planned next steps:

FlashAttention integration for memory-efficient chunk attention
YARN mscale integration
Context Parallelism support
Packed sequence support
Functional tests with end-to-end training

References

Training-Free Long-Context Scaling of Large Language Models — Original DCA paper (ICML 2024)
Qwen2 Technical Report — DCA + YARN integration

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code
I have added relevant documentation
I have run the autoformatter.sh on my PR

Implement DCA as experimental_attention_variant='dca' for efficient training on 100K+ token sequences with sub-quadratic memory complexity. Key changes: - Add DualChunkAttention module with intra-chunk, successive-chunk, and inter-chunk attention using modified RoPE position encodings - Add dca_chunk_size and dca_local_size to TransformerConfig - Integrate DCA into SelfAttention with RoPE bypass - Add DCA module spec to experimental attention variant framework - Add comprehensive unit tests

copy-pr-bot · 2026-03-29T17:06:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-03-29T17:06:12Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

- Add FlashAttention path with native GQA support and LSE-based merging - Fix missing YARN mscale in RoPE application (was defaulting to 1.0) - Auto-dispatch between FlashAttention (CUDA) and unfused (CPU) backends - Add tests for mscale, FlashAttention availability, and FA vs unfused equivalence

Ternura143 · 2026-03-29T18:11:54Z

Hi @ko3n1g , this is a draft PR implementing Dual Chunk Attention. Would appreciate any early feedback on the architecture direction before I proceed with Context Parallelism integration. Thank you!

Ternura143 requested review from a team as code owners March 29, 2026 17:06

svcnvidia-nemo-ci marked this pull request as draft March 29, 2026 17:06

github-actions bot added the community-request label Mar 29, 2026

Ternura143 mentioned this pull request Mar 29, 2026

Feature Request: Dual Chunk Attention for Long Context #2797

Open

Ternura143 added 3 commits March 30, 2026 19:08

Merge branch 'main' into feature/dual-chunk-attention

a0a6e7d

Merge branch 'main' into feature/dual-chunk-attention

1a275db

Merge branch 'main' into feature/dual-chunk-attention

e037678

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dual Chunk Attention (DCA) for long-context training#4048

Add Dual Chunk Attention (DCA) for long-context training#4048
Ternura143 wants to merge 5 commits intoNVIDIA:mainfrom
Ternura143:feature/dual-chunk-attention

Ternura143 commented Mar 29, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

Ternura143 commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ternura143 commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changes

Usage

Status: Draft

References

Pre-checks

Uh oh!

copy-pr-bot bot commented Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

Ternura143 commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ternura143 commented Mar 29, 2026 •

edited

Loading