GPU SM Limit for Max Autotune | Generated by AI

Home 2025.11

The warning “Not enough SMs to use max_autotune_gemm” comes from PyTorch’s TorchInductor backend (used in torch.compile with mode="max-autotune" or similar aggressive autotuning modes). It indicates that your GPU has fewer Streaming Multiprocessors (SMs) than the threshold required for the most exhaustive GEMM (matrix multiplication) autotuning options, which search over a huge space of Triton/CUTLASS kernel configurations for peak performance.

PyTorch considers a GPU “big enough” only if it has a substantial number of SMs (typically ≥68–80, depending on the exact PyTorch version; e.g., RTX 3080/3090 borderline, A100/H100/RTX 4090 comfortably above). On smaller GPUs (e.g., RTX 3060, 3070, 2080 Ti, T4, etc.), it disables the full max_autotune_gemm path to avoid excessive compilation time or suboptimal choices.

Why it happens and impact

How to improve or work around it

  1. Use a GPU with more SMs (best fix for true max performance):
    • Recommended minimum for reliable full max_autotune_gemm: RTX 4090 (128 SMs), A100 (108 SMs), H100 (132+ SMs), or newer datacenter cards.
    • Consumer cards below ~80 SMs (e.g., RTX 3070 = 46 SMs, RTX 3080 = 68 SMs) will trigger this.
    GPU example SM count Full max_autotune_gemm?
    RTX 3060/3070 46–58 No
    RTX 3080/3090 68–82 Borderline (sometimes yes)
    RTX 4090 128 Yes
    A100 108 Yes
    H100 132+ Yes
  2. Change torch.compile mode (no hardware change needed):
    • Use mode="max-autotune-no-cudagraphs" — keeps most autotuning benefits but skips CUDA graphs and the SM-gated GEMM path. Often nearly as fast with much shorter compile times on smaller GPUs.
    • Or mode="reduce-overhead" — lighter, uses CUDA graphs for low latency, good for inference.
    • Example:
      compiled_model = torch.compile(model, mode="max-autotune-no-cudagraphs", fullgraph=True)
      
  3. Enable high-precision matmuls (helps any mode/GPU):
    torch.set_float32_matmul_precision("high")  # or "highest"
    

    This allows TensorFloat-32 / better cuBLAS kernels.

  4. Force more aggressive tuning anyway (hacky, not officially supported):
    • Monkey-patch the check in PyTorch source (edit torch/_inductor/utils.py or runtime patch) to lower the SM threshold. Risky and may hurt performance or explode compile time.
  5. Other general tips for better alignment/efficiency (unrelated to this exact warning but addresses your padding concern):
    • Pad batch/sequence dimensions to multiples of 8/16/32 if possible (common for Tensor cores).
    • Increase batch size if your workload allows — larger GEMMs utilize the GPU better.
    • For inference-focused workloads, consider alternatives like vLLM, TensorRT-LLM, or Hugging Face Optimum if torch.compile limitations persist.

In most real-world cases, the performance difference from missing full max_autotune_gemm is small (a few percent), and max-autotune-no-cudagraphs gets you 95%+ of the way there on consumer GPUs.

PyTorch Inductor utils.py source
Torch.compile warning discussion
Dissecting torch.compile


Back

x-ai/grok-4-fast

Donate