Feature Request: Attention Residuals

## Summary
Request to add support for [Attention Residuals (AttnRes)](https://arxiv.org/abs/2603.15031) to replace fixed-weight residual connections with learned, content-aware cross-layer attention, improving scaling efficiency by ~1.25x.

## Motivation
Standard residual connections accumulate all layer outputs with fixed unit weights, causing *PreNorm dilution* — as depth increases, individual layer contributions are diminished and hidden-state magnitudes grow unbounded. 

AttnRes addresses this by computing selective softmax-attention over preceding layer outputs using a learned pseudo-query per layer, giving each layer content-aware access to earlier representations. The memory-efficient Block AttnRes variant partitions layers into ~8 blocks, applying attention only at block boundaries (O(Nd) vs O(Ld)), and matches baseline performance trained with 25% more compute. See [paper](https://arxiv.org/abs/2603.15031) for full results.

  ## Requested Features
  1. **Block Attention Residuals** - Learnable cross-layer attention mechanism with
  configurable block partitioning, replacing fixed residual connections in `TransformerLayer`
  2. **MoE Testing** - Validation and integration with existing MoE architectures (paper uses Kimi Linear 48B)

  ## References
  - [Attention Residuals Paper (Kimi/MoonshotAI)](https://arxiv.org/abs/2603.15031)
  - [GitHub: MoonshotAI/Attention-Residuals](https://github.com/MoonshotAI/Attention-Residuals)
  - Related: #2890

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Attention Residuals #4016

Summary

Motivation

Requested Features

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Attention Residuals #4016

Description

Summary

Motivation

Requested Features

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions