LSTMs Were Just Rotated ResNets | Generated by AI

Home 2025.10

In his NeurIPS 2024 Test of Time award talk on the 2014 “Sequence to Sequence Learning with Neural Networks” paper, Ilya Sutskever reflected on key insights and missteps from that era. One major point he addressed under “what we got wrong” was the overcomplication and eventual limitations of LSTMs (Long Short-Term Memory networks), which powered early sequence modeling breakthroughs like machine translation.

The Core Misconception About LSTMs

We treated LSTMs as a fundamentally novel, intricate architecture tailored specifically for sequential data—something “special” that deep learning researchers had to painstakingly engineer to handle time dependencies, vanishing gradients, and recurrence. In reality, Sutskever explained, LSTMs were far simpler than that: they’re essentially a ResNet (Residual Network) rotated 90 degrees.

Sutskever quipped: “To those unfamiliar, an LSTM is something that poor deep learning researchers did before Transformers. It’s basically a ResNet but rotated by 90 degrees… And it came before; it’s like a slightly more complex ResNet, with an integrator and some multiplications.” This analogy underscores that LSTMs weren’t a radical departure; they were an early, elegant application of residual ideas to recurrence.

Why This Mattered (and What Went Wrong)

In hindsight, the “wrong” part wasn’t inventing LSTMs—they were a heroic interim solution—but clinging to them as the endgame for sequences instead of recognizing residuals as a general tool and pushing toward more parallel architectures sooner.

This perspective reframes LSTMs not as a relic, but as a foundational step in understanding how skip connections unlock depth across dimensions.

Sequence to Sequence Learning with Neural Networks: What a Decade – Transcript & Video
Ilya Sutskever NeurIPS 2024 full talk
Reflections from Ilya’s Full Talk at NeurIPS 2024


Back

x-ai/grok-4-fast

Donate