Token and Positional Embeddings Explained

Token and Positional Embeddings Explained | Generated by AI

Home 2025.11

Explanation of `wte` and `wpe` in the GPT Model

In this GPT model implementation (based on the NanoGPT style from Andrej Karpathy’s work), the transformer module contains several key components. The two you’re asking about, wte and wpe, are both instances of nn.Embedding layers. These are used to convert discrete inputs (like tokens and positions) into dense vector representations called embeddings. Embeddings are a core part of transformer models, allowing the network to learn meaningful numerical representations for categorical data.

What is `wte`?

Full name: Token Embedding (sometimes called “Word Token Embedding”).
Purpose: It maps each unique token from the vocabulary (e.g., words, subwords, or characters) to a fixed-size vector of dimension config.n_embd (the model’s embedding size, often 768 or similar).
- The vocabulary size is config.vocab_size (e.g., 50,000 for a typical GPT tokenizer).
- Input: An integer token ID (0 to vocab_size-1).
- Output: A learned vector representing the “meaning” of that token.
Why it’s needed: Raw token IDs are just integers with no semantic info. Embeddings turn them into vectors that capture relationships (e.g., “king” and “queen” might end up with similar vectors after training).

What is `wpe`?

Full name: Positional Embedding.
Purpose: It maps each position in the input sequence (from 0 to config.block_size - 1, where block_size is the max sequence length, e.g., 1024) to a fixed-size vector of the same dimension config.n_embd.
- Input: An integer position index (0 to block_size-1).
- Output: A learned vector that encodes the position’s location in the sequence.
Why it’s needed: Transformers process sequences in parallel and don’t have built-in order awareness (unlike RNNs). Positional embeddings inject information about the relative or absolute position of tokens, so the model knows “cat” at position 1 is different from “cat” at position 10.

How Embeddings Work in Training

Yes, you’re exactly right—these are learnable parameters in the neural network:

Initialization: PyTorch’s nn.Embedding initializes the embedding matrix randomly (by default, using a uniform distribution between -√(1/dim) and +√(1/dim), where dim is n_embd). This gives them starting values that aren’t zero or random noise.
Training: During model training (e.g., via backpropagation on a language modeling loss), the gradients update these embedding weights just like any other parameter (e.g., in the transformer blocks). Over time, they learn to represent tokens and positions in a way that minimizes the loss—capturing semantics, syntax, and positional dependencies.
Usage in Forward Pass: (Not shown in your code snippet, but typically) For an input sequence of token IDs x, the model computes:
```
token_emb = wte(x)  # Shape: (batch_size, seq_len, n_embd)
pos_emb = wpe(torch.arange(seq_len))  # Broadcast to match token_emb
input_emb = token_emb + pos_emb  # Add them element-wise
```
Then input_emb (after dropout) feeds into the transformer blocks (h).

This setup is efficient and standard for GPT-like models. If you have the full forward pass or more code, I can dive deeper!

Back

x-ai/grok-4-fast

Donate