H200 Training Speed and Optimization Guide

H200 Training Speed and Optimization Guide | Generated by AI

Home 2026.04

The reason it “feels” slow is purely a matter of reporting, not actual performance. On an H200, your training is actually running at blazing fast speeds.

Here is the breakdown of why you see 21 seconds per iteration and why that is actually a good number:

1. The Math of “One Iteration”

In your nanoGPT config, an “iteration” is not one forward/backward pass. It is the sum of all Gradient Accumulation Steps.

Batch Size: 16
Block Size: 1024
Gradient Accumulation Steps (GAS): 64
Total tokens per iteration: $16 \times 1024 \times 64 = 1,048,576$ tokens.

You are processing over 1 million tokens every time a single line prints to your screen.

2. Micro-step Performance

If 1 iteration (64 steps) takes 21 seconds: $21,000\text{ms} / 64\text{ steps} \approx 328\text{ms per step}$

For a 1.5B parameter model, 328ms per step is world-class performance. Your nvidia-smi showing 689W power draw confirms the chip is working at its absolute limit. You aren’t wasting time; you are just processing a massive amount of data before each log entry.

3. Why MFU was 164%

The mfu (Model Flops Utilization) being over 100% is a known quirk in nanoGPT right after the torch.compile phase finishes. It usually stabilizes after iter 100. Given your power draw, you are likely hitting 45-55% real MFU, which is the “golden standard” for large model training.

How to make it “feel” faster

If you want to see progress updates more often (e.g., every few seconds instead of every 20), we should reduce the accumulation steps and increase the batch size to fill that 141GB VRAM better.

Suggested Optimization for H200:

Batch Size: $32$ (doubled from 16 to use more of the 141GB)
GAS: $8$ (reduced from 64)
Result: You will get a log update every ~3-5 seconds instead of every 21 seconds.

Updated Estimate

To process your 10B token dataset with the new settings (effective batch $\approx 262k$ tokens):

Total Iterations needed: $\approx 38,000$
Time per iteration: $\approx 3-5$ seconds
Total wait time: ~32-40 hours for a high-quality 1.5B model.

Should I update your config to these more “responsive” settings and push it?

Back Donate