GELU Activation Function Overview | Generated by AI
What is GELU?
The Gaussian Error Linear Unit (GELU) is a smooth activation function used in neural networks, particularly in transformer-based models like BERT and GPT series. Introduced in 2016 by researchers Dan Hendrycks and Kevin Gimpel, it combines the benefits of linear transformations with probabilistic weighting inspired by the Gaussian (normal) distribution. Unlike simpler activations like ReLU (which outputs zero for negative inputs), GELU allows a small gradient flow for negative values, leading to better performance in deep networks by reducing issues like vanishing gradients and improving convergence.
Mathematical Definition
The core formula for GELU is:
\[ \text{GELU}(x) = x \cdot \Phi(x) \]
where \(\Phi(x)\) is the cumulative distribution function (CDF) of the standard normal distribution:
\[ \Phi(x) = \frac{1}{2} \left[ 1 + \erf\left( \frac{x}{\sqrt{2}} \right) \right] \]
Here, \(\erf\) is the error function, which measures how well a Gaussian approximates a step function. This makes GELU probabilistic: it “drops” negative inputs less harshly than ReLU while keeping positive ones intact.
For computational efficiency, approximations are often used, such as the sigmoid-based version:
\[ \text{GELU}(x) \approx x \cdot \sigma(1.702x) \]
where \(\sigma\) is the sigmoid function.
Why Use GELU?
- Smoothness: It’s differentiable everywhere, aiding optimization.
- Non-monotonicity: Slightly suppresses negative values but doesn’t zero them out, preserving information.
- Empirical Benefits: Outperforms ReLU and ELU in tasks like natural language processing, with faster training and higher accuracy in large models.
- Drawbacks: Slightly more computationally expensive than ReLU due to the erf or sigmoid computation.
In practice, libraries like PyTorch and TensorFlow have built-in GELU implementations (e.g., torch.nn.GELU()).