Skip to content

Feature Request: Attention Residuals #4016

@sbhavani

Description

@sbhavani

Summary

Request to add support for Attention Residuals (AttnRes) to replace fixed-weight residual connections with learned, content-aware cross-layer attention, improving scaling efficiency by ~1.25x.

Motivation

Standard residual connections accumulate all layer outputs with fixed unit weights, causing PreNorm dilution — as depth increases, individual layer contributions are diminished and hidden-state magnitudes grow unbounded.

AttnRes addresses this by computing selective softmax-attention over preceding layer outputs using a learned pseudo-query per layer, giving each layer content-aware access to earlier representations. The memory-efficient Block AttnRes variant partitions layers into ~8 blocks, applying attention only at block boundaries (O(Nd) vs O(Ld)), and matches baseline performance trained with 25% more compute. See paper for full results.

Requested Features

  1. Block Attention Residuals - Learnable cross-layer attention mechanism with
    configurable block partitioning, replacing fixed residual connections in TransformerLayer
  2. MoE Testing - Validation and integration with existing MoE architectures (paper uses Kimi Linear 48B)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions