Summary
Request to add support for Attention Residuals (AttnRes) to replace fixed-weight residual connections with learned, content-aware cross-layer attention, improving scaling efficiency by ~1.25x.
Motivation
Standard residual connections accumulate all layer outputs with fixed unit weights, causing PreNorm dilution — as depth increases, individual layer contributions are diminished and hidden-state magnitudes grow unbounded.
AttnRes addresses this by computing selective softmax-attention over preceding layer outputs using a learned pseudo-query per layer, giving each layer content-aware access to earlier representations. The memory-efficient Block AttnRes variant partitions layers into ~8 blocks, applying attention only at block boundaries (O(Nd) vs O(Ld)), and matches baseline performance trained with 25% more compute. See paper for full results.
Requested Features
- Block Attention Residuals - Learnable cross-layer attention mechanism with
configurable block partitioning, replacing fixed residual connections in TransformerLayer
- MoE Testing - Validation and integration with existing MoE architectures (paper uses Kimi Linear 48B)
References
Summary
Request to add support for Attention Residuals (AttnRes) to replace fixed-weight residual connections with learned, content-aware cross-layer attention, improving scaling efficiency by ~1.25x.
Motivation
Standard residual connections accumulate all layer outputs with fixed unit weights, causing PreNorm dilution — as depth increases, individual layer contributions are diminished and hidden-state magnitudes grow unbounded.
AttnRes addresses this by computing selective softmax-attention over preceding layer outputs using a learned pseudo-query per layer, giving each layer content-aware access to earlier representations. The memory-efficient Block AttnRes variant partitions layers into ~8 blocks, applying attention only at block boundaries (O(Nd) vs O(Ld)), and matches baseline performance trained with 25% more compute. See paper for full results.
Requested Features
configurable block partitioning, replacing fixed residual connections in
TransformerLayerReferences