Best Models for 12GB VRAM | Generated by AI

Home 2026.07

Short answer: none of them. Even the most aggressive quant (TQ1_0 at ~158 GB) is 13× your VRAM and 2× your combined system RAM + VRAM.

The previous reply has it right — your RTX 4070 12 GB cannot run a 743B-parameter model at any quantization. Period.

What you can run on that GPU:

The MoE sweet spot for 12 GB VRAM + system RAM offload is models in the 80–110B total / 10–15B active range. Your best options right now:

Model Total params Active Q4_K_M size Your speed (est.)
GLM-4.5-Air (UD-Q3_K_XL) 106B 12B ~50 GB 8–15 tok/s with offload
Qwen3-30B-A3B 30B 3B ~18 GB 30–50 tok/s, fits mostly in VRAM
DeepSeek-V3-Lite ~68B ~8B ~35 GB 15–25 tok/s with offload

The GLM-4.5-Air with --n-cpu-moe (attention + shared expert on GPU, routed experts in system RAM) is the closest GLM-family model that’ll actually run on your hardware. The Qwen3-30B-A3B is smaller but much faster since most of it stays on the GPU.

So what should you actually download?

# Pick one:
huggingface-cli download unsloth/GLM-4.5-Air-GGUF --include "*UD-Q3_K_XL*" --local-dir ~/models
# or
huggingface-cli download unsloth/Qwen3-30B-A3B-GGUF --include "*Q4_K_M*" --local-dir ~/models

Don’t waste bandwidth on GLM-5.2 — it needs a datacenter GPU (MI300X, H100) even at 1–2 bits.


Back Donate