Distributed Training Options Overview | Generated by AI
Now I have the full picture. Here’s what we’re working with:
MACHINE COMPARISON
==================
192.168.1.47 (xiaoxin@xx)
CPU: Intel i5-12450H, 8 cores / 12 threads
RAM: 16 GB
GPU: Intel UHD Graphics (integrated, NO CUDA)
Torch: 2.12.0+cpu
192.168.1.36 (lzw@)
CPU: Intel i9-13900KF, 24 cores / 32 threads
RAM: 62 GB
GPU: RTX 4070 12GB (CUDA capable)
Torch: 2.9.1+cu128
nanochat already has DDP support built in (torchrun, NCCL backend).
Key constraint: nanochat's compute_init() uses NCCL backend for DDP, which requires CUDA. The 1.47 machine has no CUDA at all. So true multi-node DDP between these two machines isn't straightforward with the existing code.
Here are your options, from simplest to most educational:
OPTION 1: Single-node multi-process CPU DDP on 1.36 (easiest, most educational)
The i9-13900KF has 24 cores. You can launch multiple CPU workers with Gloo backend:
bash
On 192.168.1.36, you'd need to patch compute_init() to use Gloo when device=cpu
Then run:
torchrun --nproc_per_node=4 --master_addr=127.0.0.1 --master_port=29500 \
-m scripts.base_train --device-type=cpu --depth=4 --max-seq-len=512 \
--device-batch-size=1 --total-batch-size=512 --num-iterations=20
But the code currently skips DDP init when device_type != "cuda" (line in compute_init: if is_ddp_requested and device_type == "cuda"). Needs a small patch.
OPTION 2: Multi-node CPU DDP across both machines (most "distributed" learning)
Requires patching nanochat to support Gloo backend for CPU DDP, then launching torchrun on both machines with rendezvous.
OPTION 3: Just use the RTX 4070 on 1.36 with single-GPU DDP (practical)
bash
ssh lzw@192.168.1.36
cd /mnt/data/nanochat
source .venv/bin/activate
torchrun --nproc_per_node=1 -m scripts.base_train --depth=8 ...
This is what the existing run_rtx4070.sh already does.