Distributed Training Launch Command Breakdown | Generated by AI
Question: Explain each argument in this SSH + tmux + torchrun command that launches the master node of a 2-node distributed CPU training run for nanochat.
Answer:
This is a layered command: ssh → tmux → shell → torchrun → base_train. Breaking it down layer by layer.
Layer 1 — SSH wrapper
ssh lzw@<WORKSTATION_IP> "..."
Runs the quoted string as a remote shell command on your Ubuntu workstation. The trailing && echo OK prints OK only if tmux session creation succeeded — a cheap success signal back to your Mac. The 0.6 at the very end is not part of the command — it looks like a stray paste artifact (possibly a Warp timing annotation or shell history fragment). The shell would treat it as an argument to echo OK only if it were inside the quotes; outside, ssh would treat it as part of the remote command string only if quoting allowed — as written, it’s after the closing quote, so ssh actually receives it as additional remote command text appended with a space, i.e. the remote runs ... && echo OK 0.6. Harmless, but worth deleting.
Layer 2 — tmux session management
tmux kill-session -t dist-master 2>/dev/null
Idempotency: kill any prior session named dist-master so re-running the command doesn’t fail with “duplicate session”. 2>/dev/null swallows the “session not found” error on first run.
tmux new-session -d -s dist-master '...'
-d— detached: create the session but don’t attach a terminal to it. Critical because SSH will disconnect; tmux keeps the training process alive (survives the SSH session ending).-s dist-master— session name, so you can latertmux attach -t dist-masterto watch live.- The quoted string is the command the session runs. When it exits, the session dies — hence the
sleep 600at the end: keeps the pane alive 10 minutes after training finishes so you can attach and read the final output/error before tmux reaps it.
2>&1 | tee /tmp/dist_master.log
Merge stderr into stdout (torchrun and PyTorch log heavily to stderr), then tee duplicates the stream to both the tmux pane and a log file you can tail -f from another SSH session.
Layer 3 — Environment variables
| Var | Purpose |
|---|---|
NANOCHAT_BASE_DIR=~/.cache/nanochat |
nanochat’s root for tokenizer, checkpoints, eval bundles |
NANOCHAT_DATA_DIR=.../base_data_small |
Points pretraining at your small FineWeb shard subset instead of the full dataset |
NANOCHAT_FORCE_SDPA=1 |
Forces PyTorch scaled_dot_product_attention instead of FlashAttention — necessary on CPU, where FA kernels don’t exist |
OMP_NUM_THREADS=8 |
Caps OpenMP threads per process for CPU matmuls. Without this, each rank tries to grab all cores; with 2 nodes × 1 proc each, 8 threads avoids oversubscription |
PYTHONUNBUFFERED=1 |
Disables Python stdout buffering so tee and the log file get lines in real time, not in 8KB chunks |
GLOO_SOCKET_IFNAME=enp4s0 |
Tells the Gloo backend (the CPU collective backend) which network interface to bind. Without it, Gloo may pick lo or a Docker bridge and rank 1 can’t connect |
NCCL_SOCKET_IFNAME=enp4s0 |
Same for NCCL — harmless here since CPU training uses Gloo, but cheap insurance if you flip to GPU |
The IFNAME pair is the most failure-prone part of multi-node setups. Wrong interface → rendezvous hangs forever at “waiting for workers”.
Layer 4 — torchrun (the launcher)
torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 \
--master_addr=<WORKSTATION_IP> --master_port=29500 \
-m scripts.base_train -- ...
--nnodes=2— world has 2 machines. torchrun blocks until both nodes rendezvous.--nproc_per_node=1— 1 worker process per machine (1 CPU rank each). Total world size = 2 × 1 = 2.--node_rank=0— this machine is the master (rank-0 host). Your second machine runs the identical command with--node_rank=1.--master_addr/--master_port=29500— the TCP rendezvous point. Rank 0 listens here; rank 1 connects to it. 29500 is the PyTorch convention default. Both nodes must pass the same addr/port.-m scripts.base_train— run the training script as a module (soscripts/package imports resolve from the nanochat repo root, which is why thecdearlier matters).--— separator: everything after it goes tobase_train’s argparse, not torchrun’s.
torchrun sets RANK, LOCAL_RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT env vars in each spawned process; base_train reads these in its dist.init_process_group() call.
Layer 5 — base_train args (the model & run config)
--device-type cpu— use Gloo backend, no CUDA. This is a plumbing test of distributed training, not a real run.--depth 4— 4 transformer layers. nanochat derives the rest of the architecture from depth:--aspect-ratio 64—model_dim = depth × aspect_ratio = 4 × 64 = 256. So a 256-dim model.--head-dim 64— withmodel_dim=256, that’s256/64 = 4attention heads.--max-seq-len 512— context length 512 tokens (keeps the attention cost tiny on CPU).--device-batch-size 2— micro-batch per rank per forward pass: 2 sequences × 512 tokens = 1,024 tokens per step per rank.--total-batch-size 2048— target tokens per optimizer step. Gradient accumulation is derived:2048 / (2 ranks × 1024 tokens) = 1accumulation step. So each optimizer step is exactly one forward/backward per rank, then a Gloo all-reduce.--num-iterations 10— only 10 optimizer steps. Pure smoke test.--eval-every 5— run val loss at steps 5 and 10.--eval-tokens 2048— evaluate on just 2,048 tokens (one tiny val batch).--core-metric-every -1— disable the CORE benchmark eval entirely (-1= never). It’s far too expensive for a 10-step CPU run.--sample-every 10— generate sample text at step 10 so you can eyeball the (gibberish, at this scale) output.--tracker none— no wandb/logging backend.--run dist-cpu-2node— run name, used for checkpoint/log directory naming.
The mental model
The whole thing exists to answer one question: does my 2-node Gloo rendezvous and gradient all-reduce work over the LAN? Every model hyperparameter is shrunk to near-zero (depth 4, dim 256, 10 steps) so the run completes in minutes and any failure is a networking failure, not a compute one. Once OK prints and the log shows both ranks stepping in lockstep, you swap --device-type, the IFNAME backend, and the model size for the real run.
One thing to verify: rank 1’s command must match on --nnodes, --master_addr, --master_port, total batch size, and all model args — torchrun won’t catch hyperparameter mismatches, and divergent configs across ranks cause silent shape errors or hangs at the first collective.