Understanding Slow Training Iterations | Generated by AI

Home 2025.11

Those big jumps are completely normal. They aren’t “slow iterations” — they are evaluation + checkpoint iterations, which always take much longer than the normal training steps.

Here’s what’s happening in your log:

1. Normal iterations

These are the lines like:

iter 100: time 168ms
iter 200: time 170ms
...
iter 900: time 168ms

Those are pure forward+backward passes on a single training batch. They stay in a stable range (~160–180 ms), which matches your GPU capability.

2. The “slow” ones

Examples:

iter 0: time 12543ms
iter 500: time 5985ms
iter 1000: time 5726ms

These lines follow exactly after the “step X” blocks where NanoGPT does:

That entire sequence happens every eval_interval or log_interval depending on config. In your output, you can see the pattern:

step 500: train loss..., val loss...
saving checkpoint
iter 500: loss..., time 5985ms

and

step 1000: train loss..., val loss...
saving checkpoint
iter 1000: loss..., time 5726ms

So iter 1000 time isn’t the training step’s compute time — it’s the combined cost of evaluation + checkpoint writing + the actual training step.

Why iter 0 is even larger (~12 seconds)?

The very first step includes:

That’s why it’s extra huge.


Summary

Your “slow iterations” are not real training steps. They only happen at evaluation+checkpoint boundaries.

Normal iterations (those that matter for throughput) are ~170ms and stable.

If you want fewer slow spikes, you can reduce:

Let me know if you want recommended settings for faster overall training on 4070/4080/4090.


Back

openai/gpt-5

Donate