LSTM Implementation Bugs and Fixes | Generated by AI

Home 2025.09

Got it—here’s a clear, end-to-end walkthrough of what this LSTM-from-scratch script is doing, plus the subtle bugs that will break training and how to fix them. I’ll keep it linear and easy to scan (no wide tables).

What the code tries to do

You’re implementing a minimal character/word-level LSTM language model in NumPy:

  1. Setup & sizes
  1. Parameter initialization
  1. Forward pass (per sequence) For each time step:
  1. Loss
  1. Backward pass (BPTT through LSTM)
  1. Training loop

Critical issues (these will silently break learning)

1) Off-by-one indexing for h and C in backprop

In forward, you push the initial states first:

But in backward(...) you use h[t] and C[t] as if they were h_t and C_t. They aren’t; they’re shifted by one.

Fix (simple rule of thumb):

So inside the for t in reversed(range(T)): loop:

Your current line:

C_prev = C[t - 1]

is wrong for t==0 (wraps to the last element) and off by one in general. It must be:

C_prev = C[t]       # previous cell state
# and use C_t = C[t+1] as "current"

And anywhere you use h[t] intending the current hidden state, change to h[t+1].

2) Wrong derivatives for several gates

You sometimes apply the nonlinearity again instead of its derivative, or forget the derivative flag.

3) Using h[0] / C[0] to size dh_next and dC_next

You want shapes of the current h/C (end-of-sequence), not the initial zeros. Use:

dh_next = np.zeros_like(h[-1])
dC_next = np.zeros_like(C[-1])

4) Cross-entropy arithmetic stability

loss += -np.mean(np.log(outputs[t]) * targets[t]) will be fine if softmax clamps/epsilons internally. If not, add a small epsilon:

eps = 1e-12
loss += -np.sum(targets[t] * np.log(outputs[t] + eps))

5) Training stability tweaks


A corrected core for the backprop loop

This sketch shows the indexing and derivative fixes. It assumes f[t], i[t], g[t], o[t] are outputs of their nonlinearities (as in your forward):

T = len(outputs)
dh_next = np.zeros_like(h[-1])
dC_next = np.zeros_like(C[-1])

for t in reversed(range(T)):
    # Short names with correct indexing
    y_hat = outputs[t]                 # softmax output
    y_true = targets[t]
    h_t   = h[t+1]
    C_t   = C[t+1]
    C_tm1 = C[t]

    # Output layer
    dv = y_hat.copy()
    dv[np.argmax(y_true)] -= 1
    W_v_d += dv @ h_t.T
    b_v_d += dv

    # Backprop into h
    dh = W_v.T @ dv
    dh += dh_next

    # Output gate
    do = dh * np.tanh(C_t)
    do *= o[t] * (1 - o[t])
    W_o_d += do @ z[t].T
    b_o_d += do

    # Cell state
    dC = dC_next + dh * o[t] * (1 - np.tanh(C_t)**2)

    # Candidate (g), Input (i), Forget (f)
    dg = dC * i[t]
    dg *= (1 - g[t]**2)               # since g[t] = tanh(preact)

    di = dC * g[t]
    di *= i[t] * (1 - i[t])

    df = dC * C_tm1
    df *= f[t] * (1 - f[t])

    # Accumulate gate weight/bias grads
    W_g_d += dg @ z[t].T;  b_g_d += dg
    W_i_d += di @ z[t].T;  b_i_d += di
    W_f_d += df @ z[t].T;  b_f_d += df

    # Propagate to z and split to h_{t-1} part
    dz = (W_f.T @ df) + (W_i.T @ di) + (W_g.T @ dg) + (W_o.T @ do)
    dh_next = dz[:hidden_size, :]
    dC_next = dC * f[t]

Smaller but important details & recommendations


TL;DR of the fixes you should make right now

  1. Indexing: Replace every h[t] with h[t+1], every C[t] with C[t+1], and set C_prev = C[t] in backprop.
  2. Derivatives:

    • Use (1 - tanh(C_t)**2) for the cell state path.
    • Use gate * (1 - gate) for the σ-gates since you stored gate outputs.
    • Use (1 - g[t]**2) for the candidate gate since g[t]=tanh(preact).
  3. Init: Set b_f to ~1.0.
  4. LR: Drop to 1e-2 (or switch to Adam).
  5. Stability: Add eps in log.

Make those changes and your loss should start decreasing in a much more sensible way. If you want, paste your updated backward(...) and I’ll sanity-check it.


Back Donate