Fitting Qwen 122B on One H100 | Generated by AI
Honestly? It’s gonna be tight.
The math:
- Qwen 3.5:122b Q4_K_M → ~81 GB
- H100 VRAM → 80 GB
You’re about 1 GB short for the model weights alone, and that’s before overhead (KV cache, context buffer, CUDA kernels). Even with a short context window, you’d need another few GB for the runtime.
Your options:
-
Offload some layers to CPU/RAM — you have 233 GB of RAM to spare. Ollama can split the model across GPU and CPU, but the CPU-offloaded layers will be much slower. Expect decent token rates for generation with maybe 10-15 layers on CPU, but don’t expect blazing speed.
-
Try a smaller quant — if someone publishes a Q2_K or IQ3 variant, it’d fit in VRAM. But quality drops noticeably below Q4.
-
Use a smaller model — something like
qwen3.5:35bat Q4_K_M (~20 GB) would run entirely in VRAM and fly on an H100.
My recommendation: Pull qwen3.5:122b and let Ollama handle the offloading. It’ll work, just not at full GPU speed. Or go with qwen3.5:35b if you want maximum speed with zero compromise. What’s your priority — raw capability or speed?