MFU 37%: Don't Switch Back | Generated by AI

Home 2026.06

Question: Should I switch back to nanoGPT from nanochat, given that the Model Flops Utilization (MFU) on the AMD MI300X is only 37%?

Answer: Switching back to the original nanoGPT will likely not solve your low MFU issue, and it might actually make performance worse.

An MFU of 37% on an AMD MI300X for a relatively small model (~760M parameters, based on your run name) is actually quite standard, if not expected, for this setup. Moving back to vanilla nanoGPT will likely degrade your throughput because it lacks the out-of-the-box optimizations required to make AMD hardware perform well.

Here is a breakdown of why your MFU is at 37% and how you can actually increase it without abandoning your current codebase.

Why Your MFU is 37% (And Why nanoGPT Won’t Fix It)


How to Actually Push MFU Higher in nanochat

Instead of switching codebases, you can maximize your MI300X hardware utilization by adjusting your training strategy:

1. Increase the Matrix Size (Kernel Saturation)

To get higher MFU, you need to give the GPU larger matrices to crunch so it spends more time computing and less time moving data.

2. Use PyTorch Compilation (torch.compile)

If you aren’t already using it, compilation is the single best weapon against kernel launch overhead. It fuses operations together, which drastically reduces the time the MI300X spends idling between operations. Check if your scripts.base_train script supports a --compile flag.

3. Maximize FP8 Tensor Core Usage

Ensure that your network dimensions (embedding dimension, intermediate hidden dimension, vocabulary size) are all strictly multiples of 64 or 128. FP8 execution units on AMD hardware require strict alignment to hit maximum hardware capabilities. If your dimensions are unaligned, the hardware falls back to slower emulation paths, dragging down your MFU.


Back Donate