Accelerating NanoChat MI300X Training | Generated by AI

Home 2026.06

Question: How can I speed up this NanoChat MI300X training run?

Answer:

Based on the command you posted, you’re already using some of the biggest optimizations:

The remaining gains usually come from one of four areas:

1. Verify FlashAttention 2 is Actually Being Used

This is the biggest potential win.

Check the training log:

grep -i "flash" run_mi300x_d24_fa2_fp8.log

or

grep -i "attention" run_mi300x_d24_fa2_fp8.log

If NanoChat silently falls back to SDPA, you can lose 20–40% performance.

Your earlier benchmark showed:

Config Step Time
bf16 + SDPA 7.7s
FP8 + SDPA 6.4s
FP8 + FA2 ~4–5s

So FA2 is often the single largest speedup.


2. Increase Device Batch Size

You currently have:

--device-batch-size=32

On an MI300X (192 GB HBM), this is often conservative.

Try:

--device-batch-size=48

If stable:

--device-batch-size=64

Watch memory:

watch -n 1 rocm-smi --showmemuse

If utilization is under 160 GB, you’re leaving performance on the table.


3. Increase Sequence Length (Sometimes Faster)

Counterintuitively, larger sequences can improve MFU.

Current:

--max-seq-len=2048

Try:

--max-seq-len=4096

if NanoChat supports it.

Why?

Longer sequences increase arithmetic intensity and keep matrix units busier.

Typical MI300X utilization:

Seq Len MFU
2048 25–35%
4096 35–45%
8192 45–55%

4. Increase Model Size

You’re training a 24-layer ~760M model:

--depth=24

Small models often become CPU/data-loader limited.

On MI300X, 760M is relatively small.

Common observation:

Model MFU
760M 25–35%
3B 40–55%
7B 55–70%

The GPU likes bigger GEMMs.


5. Reduce Evaluation Frequency

Current:

--eval-every=1000
--eval-tokens=1048576

Each evaluation pauses training.

For pure throughput testing:

--eval-every=5000

or

--eval-every=-1

temporarily.

Also:

--sample-every=-1
--core-metric-every=-1

during benchmarking.


6. Check GPU Utilization

Run:

watch -n 1 rocm-smi

Look at:

GPU Utilization
HBM Utilization
Power Draw

Healthy training should show:

GPU Util: 90-100%
Power: 650-750W

If GPU utilization fluctuates:

20%
100%
30%
100%

you are data-loader or CPU bottlenecked.


7. Use More Data Loader Workers

Check whether NanoChat exposes:

--num-workers

or similar.

If yes:

--num-workers=16

or

--num-workers=32

can significantly improve throughput.


8. Check Actual Tokens/sec

The most important metric is:

tokens/sec
MFU

If you’re seeing:

300K tok/s

there is room for improvement.

If you’re seeing:

500K-700K tok/s

for a 760M model on MI300X, you’re already in excellent territory.


9. Profile the First 100 Steps

Run:

python -m scripts.base_train \
  ... \
  --num-iterations=100

Then inspect:

grep -i "tok/s" run.log
grep -i "mfu" run.log

The bottleneck becomes obvious:

I would try these in order:

--device-batch-size=64

then

--eval-every=5000
--sample-every=-1
--core-metric-every=-1

then verify FlashAttention 2 is active.

Those three changes typically yield the largest practical speedup without modifying NanoChat source code.


Back Donate