Fitting Qwen 122B on One H100 | Generated by AI

Home 2026.03

Honestly? It’s gonna be tight.

The math:

You’re about 1 GB short for the model weights alone, and that’s before overhead (KV cache, context buffer, CUDA kernels). Even with a short context window, you’d need another few GB for the runtime.

Your options:

  1. Offload some layers to CPU/RAM — you have 233 GB of RAM to spare. Ollama can split the model across GPU and CPU, but the CPU-offloaded layers will be much slower. Expect decent token rates for generation with maybe 10-15 layers on CPU, but don’t expect blazing speed.

  2. Try a smaller quant — if someone publishes a Q2_K or IQ3 variant, it’d fit in VRAM. But quality drops noticeably below Q4.

  3. Use a smaller model — something like qwen3.5:35b at Q4_K_M (~20 GB) would run entirely in VRAM and fly on an H100.

My recommendation: Pull qwen3.5:122b and let Ollama handle the offloading. It’ll work, just not at full GPU speed. Or go with qwen3.5:35b if you want maximum speed with zero compromise. What’s your priority — raw capability or speed?


Back Donate