RTX 4070 vLLM GGUF Limitation | Generated by AI

Home 2026.04

Question: Can gemma-4-26B-A4B-it-UD-IQ3_S.gguf be served by vLLM on a local RTX 4070 server?

Answer:

Short answer: No. You cannot serve gemma-4-26B-A4B-it-UD-IQ3_S.gguf with vLLM on an RTX 4070 (12GB).

There are two separate blockers:

1. vLLM does NOT support GGUF models

Your model:

gemma-4-26B-A4B-it-UD-IQ3_S.gguf

is:

But vLLM only supports:

vLLM cannot load GGUF at all.

So even before VRAM — it simply won’t run.

You must use one of:

RTX 4070:

12GB VRAM

Your model:

Gemma 4 26B IQ3_S

Approx VRAM usage:

So:

Best choices:

Use llama.cpp:

llama-server \
-m gemma-4-26B-A4B-it-UD-IQ3_S.gguf \
--n-gpu-layers 30

This:

Use smaller quant:

IQ2_XS
IQ2_S

These fit fully in 4070.

Use non-GGUF model:

Example:

gemma-2-27b-it AWQ
gemma-2-9b-it
qwen3-14b-awq

These can run in vLLM.

If you tell me:

I’ll give the best model for RTX 4070.

Back Donate