Fixing Gemma 4 Crashes on RTX 4070 | Generated by AI
Question: How can I fix the llama.cpp server crash after each inference when running Gemma 4 26B with an RTX 4070?
Answer:Question: How can I fix the llama.cpp server crash after each inference when running Gemma 4 26B (IQ3_S GGUF) on an RTX 4070 with 12GB VRAM?
Answer:
This is a well-known, actively tracked issue. Your build b1-25eec6f32 is exactly the version where multiple Gemma 4 bugs have been confirmed. Here is a full breakdown of root causes and actionable fixes.
Root Cause Analysis
There are three confirmed bugs in older llama.cpp builds affecting Gemma 4 with your exact setup:
1. SWA (Sliding Window Attention) Type Mismatch Bug
The GGUF file stores gemma4.attention.sliding_window_pattern as a bool[] array, but llama.cpp reads it into std::array<uint32_t, LLAMA_MAX_LAYERS> via get_key_or_arr. Due to the type mismatch between bool (1 byte) and uint32_t (4 bytes), the memory layout is misinterpreted, causing incorrect is_swa() results for most layers. This leads to wrong dimension calculations for n_embd_head_k and n_embd_k_gqa, resulting in tensor shape mismatches. This is likely what causes your clean crash after each inference.
2. Context Shift Broken for Gemma 4
Context shift is not working with Gemma 4 — after filling the context, the server forces a full prompt re-processing instead of sliding, which causes slot failures and crashes, especially relevant with your constrained 512-token context.
3. IQ3 Quant CUDA Kernel Issue
IQ3-family GGUFs on CUDA can produce gibberish or crash due to what appears to be a CUDA backend or kernel selection problem for this quant type, not a bad GGUF file itself. A workaround is to rebuild llama.cpp with GGML_CUDA_FORCE_CUBLAS=ON or use the cuBLAS flag during the CUDA build.
Fix Options (Ranked by Effectiveness)
✅ Fix #1 — Update llama.cpp to Latest Build (Most Important)
Your build b1-25eec6f32 is outdated. The SWA bool vs uint32_t fix was merged in recent builds.
cd ~/projects/llama.cpp
git pull origin master
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j$(nproc)
Your RTX 4070 is Ada Lovelace =
sm_89. Use-DCMAKE_CUDA_ARCHITECTURES=89.
✅ Fix #2 — Switch to Q4_K_M Quantization
For a 12GB VRAM GPU like the RTX 4070, the recommended variant of Gemma 4 26B-A4B is Q4 quantization — IQ3_S saves disk space but introduces instability with CUDA kernels that Q4_K_M avoids.
huggingface-cli download ggml-org/gemma-4-26B-A4B-it-GGUF \
--include "*Q4_K_M*" \
--local-dir ~/projects/llama.cpp/models/
Q4_K_M is ~13GB on disk but far more stable than IQ3_S on CUDA.
✅ Fix #3 — Rebuild with cuBLAS Forced (Workaround for IQ3 Crashes)
If you must keep IQ3_S, rebuild with forced cuBLAS to avoid the bad CUDA kernel path:
cmake -B build \
-DGGML_CUDA=ON \
-DGGML_CUBLAS=ON \
-DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j$(nproc)
Or at runtime:
GGML_CUDA_FORCE_CUBLAS=1 ./build/bin/llama-server \
-m models/gemma-4-26B-A4B-it-UD-IQ3_S.gguf \
...
✅ Fix #4 — Reduce GPU Layers to Free Compute Buffer Headroom
Your 10.7GB / 12GB usage leaves virtually no room for compute buffers during inference, which can cause OOM crashes. Try dropping 2–4 layers to CPU:
./build/bin/llama-server \
-m models/gemma-4-26B-A4B-it-UD-IQ3_S.gguf \
-ngl 27 \ # reduced from 31 — gives ~0.5-1GB breathing room
--ctx-size 2048 \ # increase from 512 if VRAM allows after reducing ngl
--flash-attn on \ # required for Gemma 4 SWA layers
--no-mmap \
--host 0.0.0.0 \
--port 8081 \
--jinja
The -ngl flag is the single most impactful parameter — it controls exactly how many transformer layers run on GPU vs CPU. Reducing context from 512 to a moderate value like 2048 after freeing VRAM headroom yields measurable stability improvements.
✅ Fix #5 — Add KV Cache Quantization to Save VRAM
Add KV cache quantization to reclaim VRAM for compute buffers:
./build/bin/llama-server \
-m models/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
-ngl 31 \
--ctx-size 4096 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--no-mmap \
--port 8081 \
--jinja
q8_0 KV cache halves the KV buffer VRAM without significant quality loss.
Recommended Action Plan (Priority Order)
| Step | Action | Expected Outcome |
|---|---|---|
| 1 | git pull + rebuild llama.cpp |
Fixes SWA crash bug |
| 2 | Download Q4_K_M variant |
Avoids IQ3 CUDA kernel instability |
| 3 | Use -ngl 27 + --cache-type-k q8_0 |
Frees VRAM headroom for compute |
| 4 | Set --ctx-size 4096 |
Usable context without OOM |
| 5 | Add --flash-attn on |
Required for Gemma 4 hybrid SWA layers |
The core problem is a combination of an outdated build with unpatched SWA bugs + IQ3_S CUDA kernel instability on your hardware. Updating the build alone may resolve the crash; switching to Q4_K_M will make it reliable.
References:
- SWA bool/uint32 type mismatch bug — GitHub Issue #21434
- Context shift broken for Gemma 4 — GitHub Issue #21379
- IQ3_XXS CUDA gibberish/crash bug — GitHub Issue #21371
- Gemma 4 26B-A4B crash with litellm — GitHub Issue #21420
- How to Run Gemma 4 with llama.cpp guide