H200 Training Speed and Optimization Guide | Generated by AI

Home 2026.04

The reason it “feels” slow is purely a matter of reporting, not actual performance. On an H200, your training is actually running at blazing fast speeds.

Here is the breakdown of why you see 21 seconds per iteration and why that is actually a good number:

1. The Math of “One Iteration”

In your nanoGPT config, an “iteration” is not one forward/backward pass. It is the sum of all Gradient Accumulation Steps.

You are processing over 1 million tokens every time a single line prints to your screen.

2. Micro-step Performance

If 1 iteration (64 steps) takes 21 seconds: \(21,000\text{ms} / 64\text{ steps} \approx 328\text{ms per step}\)

For a 1.5B parameter model, 328ms per step is world-class performance. Your nvidia-smi showing 689W power draw confirms the chip is working at its absolute limit. You aren’t wasting time; you are just processing a massive amount of data before each log entry.

3. Why MFU was 164%

The mfu (Model Flops Utilization) being over 100% is a known quirk in nanoGPT right after the torch.compile phase finishes. It usually stabilizes after iter 100. Given your power draw, you are likely hitting 45-55% real MFU, which is the “golden standard” for large model training.


How to make it “feel” faster

If you want to see progress updates more often (e.g., every few seconds instead of every 20), we should reduce the accumulation steps and increase the batch size to fill that 141GB VRAM better.

Suggested Optimization for H200:

Updated Estimate

To process your 10B token dataset with the new settings (effective batch $\approx 262k$ tokens):

Should I update your config to these more “responsive” settings and push it?


Back Donate