Activation Functions in GPT Explained | Generated by AI

Home 2025.11

Andrej Karpathy’s Explanation of Activation Functions in His “Let’s Build GPT” Video

In his YouTube video “Let’s build GPT: from scratch, in code, spelled out.” (note: the provided link points to the same video, with the timestamp at ~2:51:40), Andrej Karpathy discusses activation functions in the context of the Multi-Layer Perceptron (MLP) layer within GPT-2’s Transformer architecture. He focuses on why certain activations are chosen for language models, emphasizing gradient flow, neuron health, and empirical performance. The discussion starts around the MLP’s role (as the “map” operation after attention’s “reduce”), where he explains GELU as the key activation used in GPT-2, contrasts it with alternatives like ReLU and tanh, and briefly touches on Swish for modern context.

Here’s a breakdown of his explanations for GELU and tanh specifically, drawn directly from that section. (He doesn’t dwell heavily on tanh alone but uses it as a foil to highlight GELU’s advantages.)

GELU (Gaussian Error Linear Unit)

Karpathy describes GELU as the activation function powering the non-linearity in GPT-2’s MLP (between two linear layers). He stresses its role in ensuring smooth, reliable training in large language models like GPT-2 and BERT.

He implements it simply in code as part of the MLP forward pass, showing how it processes each token’s pooled attention output independently.

tanh (Hyperbolic Tangent)

Karpathy doesn’t use tanh in the GPT-2 build but references it as an older, flawed alternative to illustrate why modern activations like GELU win out. He frames it as a classic example of what not to rely on in deep nets.

Quick Comparison (as Karpathy Frames It)

Karpathy contrasts these in the video to justify GELU:

Activation Output Range Key Strength Main Weakness Used in GPT-2?
GELU \( (-\infty, \infty) \) Always-positive gradients; smooth & alive neurons Slightly more compute than ReLU Yes (approximate form)
tanh (-1, 1) Bounded outputs (good for some normalization) Vanishing gradients; early saturation No

He also nods to ReLU (\( \max(0, x) \), prone to dead neurons) as what GELU directly improves on, and Swish (\( x \cdot \sigma(x) \), used in later models like Llama 3) as a learnable cousin to GELU—but the core focus is GELU’s practicality for GPT.

This segment (~5-10 minutes) ties into his hands-on code walkthrough, showing how these choices make the model train reliably on Shakespeare text. For the full nuance, jump to the timestamp in the video!


Back

x-ai/grok-4.1-fast

Donate