1.5B Model Training Time Estimate

1.5B Model Training Time Estimate | Generated by AI

Home 2026.06

Question: For training a 1.5B model on nanochat, how many tokens should we prepare and how much time will it take?

Answer:

Training a 1.5B parameter model requires planning both total token budget and wall-clock time, depending on your goals (research-quality vs production-quality). Here’s a structured estimate:

1. Token budget

A good rule of thumb for transformer LLMs:

Small research runs: 50–100× model parameters in tokens
Medium/final-quality runs: 200–300× model parameters in tokens
High-quality/foundation-scale: 500×+ model parameters in tokens

Calculation for 1.5B

Model params: 1.5B ≈ 1.5×10⁹
Medium-quality target (200×): ( 1.5 × 10^9 × 200 = 3 × 10^{11} ) tokens → 300B tokens
Lower research target (100×): 150B tokens
Aggressive large run (300×): 450B tokens

Recommendation for MI300X: start with 200–250B tokens; enough to see meaningful training without overcommitting.

2. Training time estimation

Use your current MI300X throughput as reference:

Your 760M model: ~101.5k tok/sec
Step scaling roughly linear with model size in FLOPs:

Model Estimated tok/sec

760M 101.5k

1.5B ~50–60k
Token throughput is halved because compute roughly doubles.

Model	Estimated tok/sec
760M	101.5k
1.5B	~50–60k

Wall-clock time

Target: 200B tokens
Throughput: 55k tok/sec → 198M tok/hour ≈ 0.198B tok/hour
Time ≈ ( 200 / 0.198 ) ≈ 1010 hours ≈ 42 days (single MI300X, continuous)

Note: gradient accumulation, eval steps, logging, and pauses can add 10–20% overhead. Realistic total: ~45–50 days.

Smaller exploratory run: 50B tokens → ~11 days

3. Practical adjustments

Sequence length: 2048 is fine; 4096 doubles memory and halves batch, slowing throughput.
Batch size: keep max that fits memory; larger batch improves MFU.
FP8 + FA2: keep enabled for best throughput.
Checkpointing: frequent saves recommended (every 5–10B tokens) to avoid losing long runs.

4. Summary Table

Model	Target Tokens	Estimated Tok/sec	Time (hours)	Notes
1.5B	50B	55k	250	exploratory
1.5B	200B	55k	1010 (~42d)	research-quality
1.5B	300B	55k	1515 (~63d)	higher quality

Real throughput will fluctuate depending on exact batch/seq/config.

If you want, I can draft a ready-to-run nanochat config for 1.5B optimized for MI300X, including batch, grad accumulation, sequence length, and checkpoint schedule to finish 200B tokens efficiently.

Back Donate