Model Dimension Split Across Heads | Generated by AI

Home 2026.07

Question: Is it correct that in multi-head attention the model dimension is split across attention heads, and it is not the batch size or token length that is split?

Answer:

Yes, your understanding is correct.

In multi-head attention (MHA), the key idea is that the hidden dimension (model dimension, (d_{model})) is split across multiple heads, not the batch size or token length.

1. What gets split?

Assume:

In multi-head attention:

So each head works on a smaller vector per token.


2. What does NOT get split?

Each token still attends over all tokens in the sequence.


3. What actually happens internally?

For each head:

  1. You project input into Q, K, V: [ Q, K, V \in (B, T, d_{head}) ]
  2. Each head computes attention independently: [ Attention(Q, K, V) ]
  3. Then concatenate all heads: [ (B, T, h \cdot d_{head}) = (B, T, d_{model}) ]
  4. Final linear projection mixes them back.

4. Intuition

Think of it like:


5. Key takeaway


If you want, I can also draw the exact tensor shape flow (step-by-step like PyTorch view/reshape/transpose) so you can see how it is implemented in code.


Back Donate