GELU Prevents Linear Collapse in FFN

GELU Prevents Linear Collapse in FFN | Generated by AI

Home 2026.06

Question: Why is GELU necessary in the FFN layer of nanoGPT?

Answer:

Without GELU (or any non-linearity), your two linear transformations compose into a single linear transformation:

x_out = W_proj(W_fc(x) + b_fc) + b_proj
      = (W_proj @ W_fc) @ x + ...

This is still just a linear function, so the entire FFN layer collapses to linear algebra. You lose all model capacity — the network can only learn linear relationships. GELU breaks this linearity and enables the model to learn non-linear patterns.

Why GELU over ReLU?

GELU is defined as:

\[\text{GELU}(x) = x \cdot \Phi(x)\]

where $\Phi(x)$ is the cumulative distribution function of a standard normal distribution. In practice, it’s often approximated as:

\[\text{GELU}(x) \approx 0.5 \cdot x \cdot (1 + \tanh(\sqrt{2/\pi}(x + 0.044715 \cdot x^3)))\]

The key insight: GELU is smoother and differentiable everywhere, unlike ReLU which has a hard cutoff at zero. This matters because:

Adaptive gating effect: GELU weights inputs by how “typical” they are under a Gaussian. High-magnitude inputs pass through mostly unchanged; low-magnitude inputs get suppressed but not zeroed.
Gradient flow: The smooth curve means better gradient propagation during backprop, especially in deep stacks.
Empirical performance: Transformers trained with GELU converge faster and reach better loss than ReLU. This was demonstrated empirically in the BERT paper (Devlin et al., 2018).

Here’s the comparison in code:

import torch
import torch.nn.functional as F

x = torch.linspace(-3, 3, 100)

relu = F.relu(x)
gelu = F.gelu(x)  # uses approximation
silu = F.silu(x)  # also smooth, used in LLaMA

# GELU is strictly smoother at the origin
# ReLU has a sharp corner, causing gradient issues

In transformer FFNs specifically

The FFN expansion (1600 → 6400 → 1600) creates a bottleneck where GELU acts as both:

Non-linearity gate: Decides which features matter
Adaptive regularizer: Suppresses noise without killing information

Without it, the 4x expansion is wasted — you’re just multiplying by dense matrices with no learned selectivity.

Bottom line: GELU is what makes the FFN actually learn. It’s not ornamental.

Back Donate