Recurrent Transformer with Adaptive Depth

Recurrent Transformer with Adaptive Depth | Generated by AI

Home 2026.06

Question: What is Hippocrene/recurrent-transformer-0.1b-e0.4b on HuggingFace?

Answer:

This is a clever research model that combines three classic ideas into one tight package. Here’s the breakdown:

Core Idea: Weight-Tied Recurrent Depth

Instead of stacking N different transformer layers (like GPT-2 does), this model has one shared transformer block that gets applied recurrently up to 8 times per token. Think of it as:

# Normal transformer (GPT-2 style)
x = layer_1(x)
x = layer_2(x)
...
x = layer_N(x)  # N distinct weight sets

# This model (Universal Transformer style)
for step in range(max_steps):
    x = shared_block(x)   # same weights every time
    if halting_head(x) > threshold:
        break              # ACT: stop early per token

Architecture Details

Property	Value
Stored params	110M
Compute-active block	~46M
Max recurrence steps	8
Effective compute	~0.4B (hence `e0.4b` in the name)
Hidden dim	1280
Attention	20 heads × 64 head-dim
FFN	SwiGLU, 10240
Context	Transformer-XL memory, up to 4096 tokens
Tokenizer	GPT-2 (vocab 50257)

Three Papers Fused

Universal Transformer (Dehghani et al. 2018) — shared weights across depth
Adaptive Computation Time / ACT (Graves 2016) — per-token halting head; harder tokens get more recurrence steps
Transformer-XL (Dai et al. 2019) — segment-level memory for long context

The ACT piece is particularly interesting: the model measured code averaging ~6.3 recurrence steps and hard prose hitting ~8 steps — showing the halting is actually doing something meaningful.

Training Stack

Pretrained on 8.7B tokens, 13-corpus mix (FineWeb, code, books, arXiv, wiki, dialogue, Pile)
Trained in native FP8 on a single RTX PRO 6000 Blackwell
Long-context curriculum: 512 → 1024 → 2048 → 4096 tokens
SFT on 707M tokens (Dolly, Alpaca, OpenOrca, OpenHermes)
DPO on UltraFeedback

Performance

FineWeb PPL: 25.4 (vs 35.8 for a 6-layer baseline)
Zero-shot macro-acc: 0.41 across LAMBADA/HellaSwag/ARC-Easy/Winogrande

Why It’s Interesting to You

Given your nanoGPT/nanochat work and interest in MoE + transformer internals — this is a different axis of model efficiency: depth-adaptive compute vs width-sparse compute (MoE). The idea that the same weights can be reused 8× is essentially parameter efficiency via recurrence, not via sparsity.

The halting head is just a small linear probe on the hidden state — trivially implementable. If you wanted to experiment with this pattern in nanochat, it’s maybe 50 extra lines on top of a standard GPT block.

It’s a research/demo model, not production-grade — but the architecture ideas are solid and underexplored compared to the MoE wave.

References:

Back Donate