Why Sigmoid Is Avoided in LLMs | Generated by AI
Why Sigmoid Isn’t Used in GPT/LLM Models
Large language models (LLMs) like GPT rely on transformer architectures, where activation functions play a key role in introducing non-linearity in feed-forward (FFN) layers and processing attention outputs. While the sigmoid function—defined as \(\sigma(x) = \frac{1}{1 + e^{-x}}\), which squashes inputs to (0, 1)—was popular in early neural networks (e.g., for binary classification), it’s largely avoided in modern LLMs for internal activations. Here’s why, broken down by key issues:
1. Vanishing Gradients Problem
- Sigmoid saturates at extremes: For large positive \(x\), \(\sigma(x) \approx 1\); for large negative \(x\), \(\sigma(x) \approx 0\).
- Its derivative is \(\sigma’(x) = \sigma(x)(1 - \sigma(x))\), which approaches 0 in these regions. During backpropagation, this causes gradients to “vanish” (become tiny), stalling learning in deep layers.
- Transformers in LLMs are extremely deep (e.g., GPT-4 has 100+ layers), so this hampers training efficiency. Alternatives like ReLU (\(f(x) = \max(0, x)\)) or GELU (which we discussed earlier) avoid full saturation for negative inputs, allowing better gradient flow.
2. Non-Zero-Centered Outputs
- Sigmoid always outputs positive values (0 to 1), which biases weight updates during optimization. This leads to “zig-zag” gradient descent paths, slowing convergence compared to zero-centered functions like tanh or GELU.
- In transformers, FFN layers process high-dimensional embeddings, and zero-centered activations help maintain stable signal propagation across residual connections.
3. Empirical Underperformance
- Extensive experiments show sigmoid lags behind modern activations in NLP tasks. Early transformers (e.g., original GPT) used ReLU for speed and simplicity. Later models like GPT-2/3 shifted to GELU for its smooth, probabilistic behavior, which empirically boosts accuracy and convergence without sigmoid’s drawbacks.
- Variants like SiLU (Sigmoid Linear Unit, \(f(x) = x \cdot \sigma(x)\)) appear in some LLMs (e.g., InstructGPT), but this isn’t plain sigmoid—it’s multiplied by \(x\) to fix zero-centering and reduce saturation.
4. Output Layer Considerations
- For next-token prediction in LLMs, the final layer uses softmax over logits (a normalized multi-sigmoid), not sigmoid. Sigmoid would work for binary decisions but fails for multi-class vocabularies (e.g., 50k+ tokens), as outputs don’t sum to 1, leading to unstable probabilities.
- Softmax ensures a valid distribution, which is crucial for cross-entropy loss.
When Is Sigmoid Still Used?
- Rarely in hidden layers of LLMs, but occasionally in specialized spots like binary gates or older RNNs/LSTMs (pre-transformer era).
- In vision models (e.g., some CNNs), it’s sometimes used for pixel normalization, but not in text-focused LLMs.
In summary, sigmoid’s elegance from the 1980s doesn’t scale to the depth and scale of LLMs. ReLU/GELU/Swish provide faster, more reliable training with fewer pitfalls. If you’re tweaking a model, start with GELU for transformer compatibility.
References
- Linear Layers and Activation Functions in Transformer Models
- Deep Learning 101: Transformer Activation Functions Explainer
- Common Activation Functions
- Understanding Logits, Sigmoid, Softmax, and Cross-Entropy Loss