Moonshot AI’s Dynamic Layer Attention | Generated by AI

Home 2026.03

Question: What is a comprehensive introduction to Kimi’s Attention Residual (from Moonshot AI)?

Answer:

Attention Residuals (AttnRes) is a novel architectural innovation introduced by Moonshot AI (the team behind Kimi) in March 2026. It replaces the traditional fixed residual connections in transformer models with a learned, input-dependent attention mechanism over previous layers.

This change addresses long-standing limitations in how deep networks aggregate information across depth.

Traditional Residual Connections – The Baseline

In a standard transformer (post-LN or pre-LN), each layer updates the hidden state like this:

x_{l} = x_{l-1} + F_l(x_{l-1})

(or in pre-LN: LayerNorm → sub-layer → add)

Core Idea of Attention Residuals

Instead of blind addition of the immediate previous layer, each layer attends over all (or many) previous layers using softmax attention.

Simplified mathematical view:

Let the outputs of previous layers be {h₁, h₂, …, h_{l-1}}

For layer l:

x_l^{pre} = ∑_{i=1}^{l-1} α_{l,i} ⋅ h_i

Key insight: depth is treated analogously to time in sequence models → attention over depth instead of fixed recurrence.

Main Benefits

Practical Implementation: Block Attention Residuals

Full attention over every previous layer would be quadratic in depth (very expensive at 100+ layers).

Moonshot uses block-wise grouping:

Reported results show:

Integration Example: Kimi Linear

Moonshot integrated Attention Residuals into their Kimi Linear model (hybrid linear-attention + MLA architecture, 48B total / 3B active params, trained on 1.4T tokens).

It helps mitigate PreNorm issues and contributes to strong scaling behavior.

Summary – Why It Matters

Attention Residuals is one of the cleanest and most elegant updates to the core transformer block since the original “Attention is All You Need” paper (2017).

It completes the “attention-ification” transition:

Many researchers view it as a natural evolution that was surprisingly overlooked for almost a decade.

For the full technical details, refer to the official paper released by Moonshot AI.

References:


Back Donate