Understanding PyTorch Broadcasting in NanoGPT

Understanding PyTorch Broadcasting in NanoGPT | Generated by AI

Home 2025.11

Great question! This is a very common point of confusion when first looking at the NanoGPT or GPT-2 code.

Let’s look at the shapes carefully:

tok_emb = self.transformer.wte(idx)    # Shape: (B, T, n_embd)
pos_emb = self.transformer.wpe(pos)    # Shape: (T, n_embd)   ← only 1D!

How can you add (B, T, n_embd) + (T, n_embd)?

Answer: PyTorch automatically broadcasts the position embeddings across the batch dimension.

When PyTorch sees:

tok_emb + pos_emb

it treats pos_emb (shape (T, n_embd)) as if it were (1, T, n_embd) and then broadcasts it to match (B, T, n_embd).

So this addition becomes:

(B, T, n_embd)
     + 
(1, T, n_embd)   ← automatically added by broadcasting
     =
(B, T, n_embd)

This is exactly how the original GPT-2 and all modern transformer implementations work — position embeddings are shared across all sequences in the batch.

You can verify this works in Python:

import torch
B, T, C = 2, 3, 768
tok_emb = torch.randn(B, T, C)
pos_emb = torch.randn(T, C)

result = tok_emb + pos_emb
print(result.shape)  # → torch.Size([2, 3, 768])

No error, works perfectly.

So in NanoGPT (and the original GPT-2), this line is 100% correct and intentional:

x = self.transformer.drop(tok_emb + pos_emb)

It’s relying on PyTorch’s broadcasting — clean, efficient, and standard practice.

No need for pos_emb.unsqueeze(0) or anything else. Broadcasting does the job elegantly.

Back Donate