Understanding PyTorch Compile Process and Timing | Generated by AI

Home 2026.04

Here’s what’s happening under the hood:

torch.compile() — What It Does

torch.compile() transforms your model into optimized CUDA kernels via TorchInductor. It’s not just-in-time in the simple sense — it goes through multiple phases:

Phase 1: Graph Tracing (what you saw at iter 0)

Phase 2: Kernel Compilation (what’s happening now)

Phase 3: Code Generation & Optimization

Why It’s Slow Right Now

The compile workers are still building kernels in the background. Your GPU is at 100% Util because it’s running the unoptimized model + simultaneously compiling new kernels. That’s why:

When It’ll Get Fast

Once all workers finish (typically 2-10 minutes after training starts for a 1.5B model), you’ll see:

Your Current State

GPU: 100% Util ✓ — working, not idle
Power: 691W ✓ — under load
Memory: 77GB ✓ — model loaded correctly
Workers: 24 compile workers running — waiting for completion

Everything looks correct. The 21s/step is temporary compilation overhead. By iter 100-200 you’ll likely see the real speed.

One More Thing

If compile = True but you’d rather skip the wait and just run (slower but immediately fast), you could set compile = False. But since you have 4-5 hours of training ahead, the compile wait (5-10 min) is worth it — it speeds up every subsequent step by 5-10x. 🦞


Back Donate