GELU Activation Function Overview | Generated by AI

Home 2025.10

What is GELU?

The Gaussian Error Linear Unit (GELU) is a smooth activation function used in neural networks, particularly in transformer-based models like BERT and GPT series. Introduced in 2016 by researchers Dan Hendrycks and Kevin Gimpel, it combines the benefits of linear transformations with probabilistic weighting inspired by the Gaussian (normal) distribution. Unlike simpler activations like ReLU (which outputs zero for negative inputs), GELU allows a small gradient flow for negative values, leading to better performance in deep networks by reducing issues like vanishing gradients and improving convergence.

Mathematical Definition

The core formula for GELU is:

\[ \text{GELU}(x) = x \cdot \Phi(x) \]

where \(\Phi(x)\) is the cumulative distribution function (CDF) of the standard normal distribution:

\[ \Phi(x) = \frac{1}{2} \left[ 1 + \erf\left( \frac{x}{\sqrt{2}} \right) \right] \]

Here, \(\erf\) is the error function, which measures how well a Gaussian approximates a step function. This makes GELU probabilistic: it “drops” negative inputs less harshly than ReLU while keeping positive ones intact.

For computational efficiency, approximations are often used, such as the sigmoid-based version:

\[ \text{GELU}(x) \approx x \cdot \sigma(1.702x) \]

where \(\sigma\) is the sigmoid function.

Why Use GELU?

In practice, libraries like PyTorch and TensorFlow have built-in GELU implementations (e.g., torch.nn.GELU()).

References


Back

x-ai/grok-4-fast

Donate