NanoGPT Architecture and Token Processing

NanoGPT Architecture and Token Processing | Generated by AI

Home 2025.10

Yes, you’re on the right track with understanding the NanoGPT architecture (as implemented in Andrej Karpathy’s popular repository), but let’s clarify and structure it properly — especially the relationship between sequence length (512 tokens), heads, embeddings, and the transformer blocks.

NanoGPT Overview (Typical Config)

NanoGPT is a from-scratch GPT-2-like model trained on Shakespeare (or tiny datasets). The default config is roughly GPT-2 small scaled down.

Parameter	Typical Value in NanoGPT
`block_size` (max sequence length)	512
`n_embd` (embedding dimension)	768 (for 124M model)
`n_head`	12
`n_layer`	12

But for smaller demos, it uses even tinier configs (e.g., 64M or less).

Your Question Breakdown:

“for every 512 tokens, they have GPT model”

No.
The entire input sequence is 512 tokens, and one GPT model processes all 512 tokens at once (in parallel during training, autoregressively during inference).

So:

Input: batch of sequences, each up to 512 tokens
One single GPT model processes all 512 positions in parallel (thanks to attention masking)

“512 will be like 8 head 64 tokens”

Close, but not quite.

Let’s clarify multi-head attention:

n_embd = total embedding dimension (e.g., 768)
n_head = number of attention heads (e.g., 12)
Head dimension = n_embd // n_head = 768 // 12 = 64

So:

Each head operates on 64-dimensional vectors
There are 12 heads, each looking at all 512 tokens
Total: 12 heads × 64 dim = 768 dim

So yes — each head processes 512 tokens with 64-dim queries/keys/values

Input: [512 tokens] → each token has 768-dim embedding
       ↓ split into 12 heads
       → each head: 512 × 64 matrix (Q, K, V)
       → self-attention over 512 positions

Inside the Transformer Block (Forward Pass)

Each transformer block does:

def forward(x):
    # x: [batch, seq_len=512, n_embd=768]

    # 1. LayerNorm + Multi-Head Self-Attention
    attn_out = self.attn(self.ln1(x))   # 12 heads, each 64-dim
    x = x + attn_out                    # residual

    # 2. LayerNorm + Feed-Forward (MLP)
    ff_out = self.mlp(self.ln2(x))      # usually 4× expansion (3072 dim)
    x = x + ff_out                      # residual

    return x

This block is repeated n_layer times (e.g., 12).

Full Forward Pass Summary

Input tokens → [B, 512]
→ Token + Position Embeddings → [B, 512, 768]
→ For each of 12 layers:
    → Multi-Head Attention (12 heads × 64 dim) over 512 tokens
    → Feed-Forward (768 → 3072 → 768)
→ Final LayerNorm
→ Linear head → logits [B, 512, vocab_size]

Key Points

Concept	Value
Max sequence length	512 tokens
Model processes	All 512 tokens at once
Attention heads	12 (for 124M model)
Per head dim	64
Attention computes	`softmax(QK^T / √64)` over 512×512 matrix
Causal masking	Ensures token only attends to past

Visual

[Token 1] → emb → [768]
[Token 2] → emb → [768]
...
[Token 512] → emb → [768]
      ↓
[512 × 768] → 12 Transformer Blocks
      ↓
[512 × 768] → Linear → [512 × 50257] logits

Each block has 12 attention heads, each doing attention over all 512 positions.

Reference (NanoGPT code)

From model.py:

self.transformer = nn.ModuleDict(dict(
    wte = nn.Embedding(vocab_size, n_embd),
    wpe = nn.Embedding(block_size, n_embd),
    h = nn.ModuleList([Block(n_embd, n_head) for _ in range(n_layer)]),
    ln_f = nn.LayerNorm(n_embd),
))

And in Block:

self.attn = CausalSelfAttention(n_embd, n_head)  # n_head=12, head_size=64

Final Answer to Your Question

Yes, in NanoGPT:

The model takes up to 512 tokens as input.

It uses multi-head attention (e.g., 12 heads).

Each head works with 64-dimensional vectors.

Inside each transformer block, there is a forward pass with:

Self-attention over all 512 tokens (causal)

Feed-forward network

The entire GPT model (all layers) processes all 512 tokens in parallel during training.

So your intuition is mostly correct — just remember: one model, 512 tokens, 12 heads × 64 dim, repeated in layers.

Let me know if you want the exact config for 124M vs smaller models!

Back

x-ai/grok-4-fast

Donate