Master-Worker torchrun Command Differences | Generated by AI
Question: What are the differences between the master-node and worker-node torchrun launch commands in this 2-node CPU distributed nanochat training setup?
Answer:
Diff view (master → worker):
- ssh lzw@<MASTER_IP> tmux session: dist-master
+ ssh xiaoxin@<WORKER_IP> tmux session: dist-worker
- cd /mnt/data/nanochat
+ cd ~/projects/nanochat
- export NANOCHAT_DATA_DIR=~/.cache/nanochat/base_data_small
+ (not set on worker)
- export OMP_NUM_THREADS=8
+ export OMP_NUM_THREADS=4
- export GLOO_SOCKET_IFNAME=enp4s0 # wired ethernet
+ export GLOO_SOCKET_IFNAME=wlp3s0 # WiFi
- --node_rank=0
+ --node_rank=1
- tee /tmp/dist_master.log
+ tee /tmp/dist_worker.log
Everything else — --nnodes=2 --nproc_per_node=1 --master_addr --master_port=29500, and all training args (--depth 4 --aspect-ratio 64 --head-dim 64 --max-seq-len 512 --device-batch-size 2 --total-batch-size 2048 --num-iterations 10 etc.) — is identical, which is correct: torchrun requires matching world topology and the training script requires matching hyperparameters across ranks.
What each difference means:
-
--node_rank=0vs1— the only torchrun arg that must differ. Rank 0 hosts the rendezvous (TCPStore) atmaster_addr:29500; rank 1 connects to it. Global ranks become 0 and 1 sincenproc_per_node=1. -
NANOCHAT_DATA_DIRonly on master — potential bug. In DDP, every rank loads its own data shard; nanochat’s dataloader usesddp_rankto stride through shards. If the worker falls back to a default data dir that’s empty or different, rank 1 will either crash on missing files or train on different data. Check the worker has the samebase_data_smallshards at whatever its default path resolves to ($NANOCHAT_BASE_DIR/base_datatypically). Safer: set it explicitly on both. -
GLOO_SOCKET_IFNAME/NCCL_SOCKET_IFNAME— these pin the collective backend to a specific NIC. Master uses wired (enp4s0), worker uses WiFi (wlp3s0). Since this is--device-type cpu, only Gloo matters;NCCL_SOCKET_IFNAMEis inert here. The WiFi link will be your gradient all-reduce bottleneck — even a tiny depth-4 model syncs every step, and WiFi latency (~2-10ms RTT, plus jitter) will dominate. Expect the wired→WiFi hop to cap throughput well below what either CPU can compute. -
OMP_NUM_THREADS=8vs4— per-machine tuning for core counts. Fine, but note DDP waits for the slowest rank every step, so the 4-thread worker sets the pace. If the worker box has spare cores, raising this directly speeds up the whole job. -
Paths/users/session names — cosmetic, per-host layout.
Quick sanity check before the next run:
# verify worker actually has data where it expects
ssh xiaoxin@<WORKER_IP> 'ls ~/.cache/nanochat/base_data* 2>/dev/null | head'
If gradient sync over WiFi is killing step time, you can measure it: compare tok/s from a single-node run (--nnodes=1) on the master vs the 2-node run — the gap is almost entirely Gloo all-reduce over the wireless hop.