Distributed Training Options Overview | Generated by AI

Home 2026.06

Now I have the full picture. Here’s what we’re working with:

MACHINE COMPARISON
==================

192.168.1.47 (xiaoxin@xx)
  CPU:  Intel i5-12450H, 8 cores / 12 threads
  RAM:  16 GB
  GPU:  Intel UHD Graphics (integrated, NO CUDA)
  Torch: 2.12.0+cpu

192.168.1.36 (lzw@)
  CPU:  Intel i9-13900KF, 24 cores / 32 threads
  RAM:  62 GB
  GPU:  RTX 4070 12GB (CUDA capable)
  Torch: 2.9.1+cu128

nanochat already has DDP support built in (torchrun, NCCL backend).


Key constraint: nanochat's compute_init() uses NCCL backend for DDP, which requires CUDA. The 1.47 machine has no CUDA at all. So true multi-node DDP between these two machines isn't straightforward with the existing code.

Here are your options, from simplest to most educational:



OPTION 1: Single-node multi-process CPU DDP on 1.36 (easiest, most educational)

The i9-13900KF has 24 cores. You can launch multiple CPU workers with Gloo backend:

bash
On 192.168.1.36, you'd need to patch compute_init() to use Gloo when device=cpu
Then run:
torchrun --nproc_per_node=4 --master_addr=127.0.0.1 --master_port=29500 \
    -m scripts.base_train --device-type=cpu --depth=4 --max-seq-len=512 \
    --device-batch-size=1 --total-batch-size=512 --num-iterations=20


But the code currently skips DDP init when device_type != "cuda" (line in compute_init: if is_ddp_requested and device_type == "cuda"). Needs a small patch.



OPTION 2: Multi-node CPU DDP across both machines (most "distributed" learning)

Requires patching nanochat to support Gloo backend for CPU DDP, then launching torchrun on both machines with rendezvous.



OPTION 3: Just use the RTX 4070 on 1.36 with single-GPU DDP (practical)

bash
ssh lzw@192.168.1.36
cd /mnt/data/nanochat
source .venv/bin/activate
torchrun --nproc_per_node=1 -m scripts.base_train --depth=8 ...


This is what the existing run_rtx4070.sh already does.

Back Donate