Hyperbolic Tangent Activation Explained

Hyperbolic Tangent Activation Explained | Generated by AI

Home 2026.06

Question: What is tanh (hyperbolic tangent)?

Answer:

Tanh is the hyperbolic tangent activation function. In practice, it’s one of the most important nonlinearities in neural networks.

The Math

\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{\sinh(x)}{\cosh(x)}\]

Or equivalently (and more stably): $\tanh(x) = \frac{2}{1 + e^{-2x}} - 1 = 2 \cdot \sigma(2x) - 1$

where $\sigma$ is the sigmoid function. This identity is useful — tanh is just a scaled and shifted sigmoid.

Key Properties

Property	Value
Output range	$(-1, 1)$ (zero-centered)
Derivative	$\frac{d}{dx}\tanh(x) = 1 - \tanh^2(x)$
Gradient at x=0	$1$ (stronger than sigmoid’s $0.25$)
Symmetry	Odd function: $\tanh(-x) = -\tanh(x)$

Implementation

import numpy as np
import matplotlib.pyplot as plt

# Forward pass
def tanh(x):
    return np.tanh(x)

# Derivative (useful for backprop)
def tanh_derivative(x):
    return 1 - np.tanh(x)**2

# Numerically stable version (avoids overflow on large x)
def tanh_stable(x):
    return np.tanh(x)  # NumPy already does this

# Visualize
x = np.linspace(-5, 5, 1000)
y = np.tanh(x)
dy = 1 - np.tanh(x)**2

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(x, y, label='tanh(x)', linewidth=2)
plt.axhline(0, color='k', linestyle='-', linewidth=0.5)
plt.axvline(0, color='k', linestyle='-', linewidth=0.5)
plt.grid(True, alpha=0.3)
plt.legend()
plt.title('tanh activation')
plt.xlabel('x')
plt.ylabel('tanh(x)')

plt.subplot(1, 2, 2)
plt.plot(x, dy, label="1 - tanh²(x)", linewidth=2, color='orange')
plt.grid(True, alpha=0.3)
plt.legend()
plt.title('tanh derivative')
plt.xlabel('x')
plt.ylabel('gradient')
plt.tight_layout()
plt.show()

Why It Matters in Deep Learning

Zero-centered output: Unlike sigmoid (which outputs $(0, 1)$), tanh outputs $(-1, 1)$. This centers the activation around zero, which helps gradient flow during backprop. The negative values mean neurons can “turn off” strongly, not just be quiet.

Stronger gradients: The derivative peaks at $\frac{d}{dx}\tanh(x)

_{x=0} = 1$, vs sigmoid’s $0.25$. This helps mitigate vanishing gradients in deeper networks.

RNN/LSTM standard: Tanh is the default in LSTM and GRU cells because of these properties. The gate activations typically use sigmoid, but the candidate hidden states use tanh.
Symmetric: The odd symmetry means if you initialize weights to zero, the network doesn’t have a bias toward positive or negative activations.

Comparison to ReLU

Tanh:

Smooth everywhere, differentiable
Bounded output (numerically stable)
Vanishing gradient problem if $ x $ is large
Used in RNNs, LSTMs, older feedforward nets

ReLU:

Dead neuron problem (zero gradient for $x < 0$)
Unbounded output
No vanishing gradient for positive inputs
Modern default for feedforward/CNN, much faster

When to Use

LSTMs/GRUs: Always tanh for cell state updates
RNNs: Often better than sigmoid, though GRUs have mostly replaced vanilla RNNs
Older networks: Pre-ReLU era (2010s), tanh was standard
Output layer for regression: If your target is in $(-1, 1)$ (after normalization), tanh output + MSE loss is natural

PyTorch/TensorFlow

import torch
import torch.nn.functional as F

x = torch.randn(32, 64)
y = torch.tanh(x)  # Direct application
y = F.tanh(x)       # Same thing

# In a layer
lstm = torch.nn.LSTM(input_size=128, hidden_size=256)

Bottom line: Tanh is a smoother, zero-centered sigmoid. It’s mathematically elegant and essential for recurrent architectures. For feedforward nets, ReLU won out because it’s faster and avoids some gradient pathologies, but tanh’s properties are still theoretically superior in many ways — it just doesn’t scale as well to 10B+ parameter models.

Back Donate