Accelerating NanoChat MI300X Training

Accelerating NanoChat MI300X Training | Generated by AI

Home 2026.06

Question: How can I speed up this NanoChat MI300X training run?

Answer:

Based on the command you posted, you’re already using some of the biggest optimizations:

FP8 enabled (--fp8)
FlashAttention 2 (from your earlier tests)
Large batch size (524288)
MI300X GPU
Sequence length 2048

The remaining gains usually come from one of four areas:

1. Verify FlashAttention 2 is Actually Being Used

This is the biggest potential win.

Check the training log:

grep -i "flash" run_mi300x_d24_fa2_fp8.log

grep -i "attention" run_mi300x_d24_fa2_fp8.log

If NanoChat silently falls back to SDPA, you can lose 20–40% performance.

Your earlier benchmark showed:

Config	Step Time
bf16 + SDPA	7.7s
FP8 + SDPA	6.4s
FP8 + FA2	~4–5s

So FA2 is often the single largest speedup.

2. Increase Device Batch Size

You currently have:

--device-batch-size=32

On an MI300X (192 GB HBM), this is often conservative.

Try:

--device-batch-size=48

If stable:

--device-batch-size=64

Watch memory:

watch -n 1 rocm-smi --showmemuse

If utilization is under 160 GB, you’re leaving performance on the table.

3. Increase Sequence Length (Sometimes Faster)

Counterintuitively, larger sequences can improve MFU.

Current:

--max-seq-len=2048

Try:

--max-seq-len=4096

if NanoChat supports it.

Why?

Longer sequences increase arithmetic intensity and keep matrix units busier.

Typical MI300X utilization:

Seq Len	MFU
2048	25–35%
4096	35–45%
8192	45–55%

4. Increase Model Size

You’re training a 24-layer ~760M model:

--depth=24

Small models often become CPU/data-loader limited.

On MI300X, 760M is relatively small.

Common observation:

Model	MFU
760M	25–35%
3B	40–55%
7B	55–70%

The GPU likes bigger GEMMs.

5. Reduce Evaluation Frequency

Current:

--eval-every=1000
--eval-tokens=1048576

Each evaluation pauses training.

For pure throughput testing:

--eval-every=5000

--eval-every=-1

temporarily.

Also:

--sample-every=-1

--core-metric-every=-1

during benchmarking.

6. Check GPU Utilization

Run:

watch -n 1 rocm-smi

Look at:

GPU Utilization
HBM Utilization
Power Draw

Healthy training should show:

GPU Util: 90-100%
Power: 650-750W

If GPU utilization fluctuates:

20%
100%
30%
100%

you are data-loader or CPU bottlenecked.

7. Use More Data Loader Workers

Check whether NanoChat exposes:

--num-workers

or similar.

If yes:

--num-workers=16

--num-workers=32

can significantly improve throughput.

8. Check Actual Tokens/sec

The most important metric is:

tokens/sec
MFU

If you’re seeing:

300K tok/s

there is room for improvement.

If you’re seeing:

500K-700K tok/s

for a 760M model on MI300X, you’re already in excellent territory.

9. Profile the First 100 Steps

Run:

python -m scripts.base_train \
  ... \
  --num-iterations=100

Then inspect:

grep -i "tok/s" run.log
grep -i "mfu" run.log

The bottleneck becomes obvious:

low MFU → GPU not saturated
high MFU but low tok/s → model too small
fluctuating utilization → dataloader bottleneck
high memory usage → batch size limit

Recommended First Changes

I would try these in order:

--device-batch-size=64

then

--eval-every=5000
--sample-every=-1
--core-metric-every=-1

then verify FlashAttention 2 is active.

Those three changes typically yield the largest practical speedup without modifying NanoChat source code.

Back Donate