Keep nanochat, 37% MFU is fine | Generated by AI

Home 2026.06

Question: Should I switch back to nanoGPT instead of nanochat since MFU is only ~37% on MI300X?

Answer:

No, switching back to nanoGPT would almost certainly make things worse for your current setup.

What you’re seeing (≈37% MFU on MI300X with FA2 + FP8) is actually normal to decent for large-scale transformer training, especially with long sequences (2048–4096) and modern kernels. You are already in the “highly optimized but hardware-limited” region rather than a “framework-limited” region.

Why nanochat is the better choice here

nanochat (your current stack) already gives you:

nanoGPT, by contrast:

So even if nanoGPT feels “simpler,” it is not the performance path for your hardware.


About the 37% MFU number

On MI300X, MFU is often misunderstood. A few important realities:

So MFU ≠ “bad” here. It just means “not perfectly saturated theoretical FLOPs,” which is almost never achievable in real LLM training.


Why your config is already close to optimal

Your best run:

This is already a sweet spot between throughput and stability. Increasing MFU further usually requires trade-offs.


If you want to push beyond 37% MFU (real options)

Instead of switching frameworks, you’d get more by tuning:

1. Increase compute density per step

2. Reduce overhead gaps

3. Kernel + attention tuning

4. Sequence packing (big one)

If your data allows it:

5. Compilation / graph capture


Bottom line

If you want, I can help you push this MI300X setup toward ~45–50% MFU range by systematically tuning batch/seq/accumulation and identifying the real bottleneck.


Back Donate