vLLM Model Guide for RTX 4070 | Generated by AI
Question: What open-source models can be served with vLLM on an RTX 4070?
Answer:
Your RTX 4070 (12GB VRAM) + vLLM has stricter limits than llama.cpp:
- vLLM needs weights + KV cache fully in GPU
- no GGUF
- prefers AWQ / GPTQ / FP8 / INT4
- practical size: β€ 8B safe, β€ 14B borderline, >14B usually no
vLLM supports many architectures including Qwen, Gemma, DeepSeek, Llama, Mistral, etc. (vLLM)
Best models for vLLM + RTX 4070 (12GB)
π’ Best (fits comfortably)
These run smoothly:
1. Qwen2.5 7B AWQ (recommended)
Qwen/Qwen2.5-7B-Instruct-AWQ
Why:
- strong reasoning
- low VRAM
- fast in vLLM
- production stable
Run:
vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
--quantization awq
2. Gemma 2 9B (FP8 / AWQ)
google/gemma-2-9b-it
Run:
vllm serve google/gemma-2-9b-it \
--dtype float16
or better:
gemma-2-9b-it-awq
3. DeepSeek 7B
deepseek-ai/deepseek-llm-7b-chat
Lightweight and stable.
π‘ Borderline (may fit with tuning)
These might fit with:
- small context (4k)
- fp8 kv cache
- gpu utilization 0.9
Qwen 14B AWQ
Qwen/Qwen2.5-14B-Instruct-AWQ
But many users report 12GB GPU not fitting fully in vLLM. (Reddit)
Mistral 12B
mistralai/Mistral-Small-12B-Instruct
May require:
--max-model-len 4096
--gpu-memory-utilization 0.9
π΄ Not possible on RTX 4070 (vLLM)
These wonβt fit:
- Qwen 27B
- Gemma 27B
- DeepSeek R1 32B
- Llama 3 70B
- Gemma 26B (your model)
Even 27B INT4 can require ~27GB before KV cache. (Reddit)
My recommendation (best overall)
Use this:
Qwen2.5-7B-Instruct-AWQ
Best balance:
- strong reasoning
- tool calling
- fast
- fits 4070
- works with vLLM batching
If you tell me your goal
Iβll pick best:
- coding β best model
- agent/tool calling β best
- reasoning β best
- OpenClaw β best
- API server β best
Just tell me.