Fixing Gemma 4 Crashes on RTX 4070 | Generated by AI

Home 2026.04

Question: How can I fix the llama.cpp server crash after each inference when running Gemma 4 26B with an RTX 4070?

Answer:Question: How can I fix the llama.cpp server crash after each inference when running Gemma 4 26B (IQ3_S GGUF) on an RTX 4070 with 12GB VRAM?

Answer:

This is a well-known, actively tracked issue. Your build b1-25eec6f32 is exactly the version where multiple Gemma 4 bugs have been confirmed. Here is a full breakdown of root causes and actionable fixes.


Root Cause Analysis

There are three confirmed bugs in older llama.cpp builds affecting Gemma 4 with your exact setup:

1. SWA (Sliding Window Attention) Type Mismatch Bug

The GGUF file stores gemma4.attention.sliding_window_pattern as a bool[] array, but llama.cpp reads it into std::array<uint32_t, LLAMA_MAX_LAYERS> via get_key_or_arr. Due to the type mismatch between bool (1 byte) and uint32_t (4 bytes), the memory layout is misinterpreted, causing incorrect is_swa() results for most layers. This leads to wrong dimension calculations for n_embd_head_k and n_embd_k_gqa, resulting in tensor shape mismatches. This is likely what causes your clean crash after each inference.

2. Context Shift Broken for Gemma 4

Context shift is not working with Gemma 4 — after filling the context, the server forces a full prompt re-processing instead of sliding, which causes slot failures and crashes, especially relevant with your constrained 512-token context.

3. IQ3 Quant CUDA Kernel Issue

IQ3-family GGUFs on CUDA can produce gibberish or crash due to what appears to be a CUDA backend or kernel selection problem for this quant type, not a bad GGUF file itself. A workaround is to rebuild llama.cpp with GGML_CUDA_FORCE_CUBLAS=ON or use the cuBLAS flag during the CUDA build.


Fix Options (Ranked by Effectiveness)

✅ Fix #1 — Update llama.cpp to Latest Build (Most Important)

Your build b1-25eec6f32 is outdated. The SWA bool vs uint32_t fix was merged in recent builds.

cd ~/projects/llama.cpp
git pull origin master
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j$(nproc)

Your RTX 4070 is Ada Lovelace = sm_89. Use -DCMAKE_CUDA_ARCHITECTURES=89.


✅ Fix #2 — Switch to Q4_K_M Quantization

For a 12GB VRAM GPU like the RTX 4070, the recommended variant of Gemma 4 26B-A4B is Q4 quantization — IQ3_S saves disk space but introduces instability with CUDA kernels that Q4_K_M avoids.

huggingface-cli download ggml-org/gemma-4-26B-A4B-it-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ~/projects/llama.cpp/models/

Q4_K_M is ~13GB on disk but far more stable than IQ3_S on CUDA.


✅ Fix #3 — Rebuild with cuBLAS Forced (Workaround for IQ3 Crashes)

If you must keep IQ3_S, rebuild with forced cuBLAS to avoid the bad CUDA kernel path:

cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_CUBLAS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j$(nproc)

Or at runtime:

GGML_CUDA_FORCE_CUBLAS=1 ./build/bin/llama-server \
  -m models/gemma-4-26B-A4B-it-UD-IQ3_S.gguf \
  ...

✅ Fix #4 — Reduce GPU Layers to Free Compute Buffer Headroom

Your 10.7GB / 12GB usage leaves virtually no room for compute buffers during inference, which can cause OOM crashes. Try dropping 2–4 layers to CPU:

./build/bin/llama-server \
  -m models/gemma-4-26B-A4B-it-UD-IQ3_S.gguf \
  -ngl 27 \               # reduced from 31 — gives ~0.5-1GB breathing room
  --ctx-size 2048 \       # increase from 512 if VRAM allows after reducing ngl
  --flash-attn on \       # required for Gemma 4 SWA layers
  --no-mmap \
  --host 0.0.0.0 \
  --port 8081 \
  --jinja

The -ngl flag is the single most impactful parameter — it controls exactly how many transformer layers run on GPU vs CPU. Reducing context from 512 to a moderate value like 2048 after freeing VRAM headroom yields measurable stability improvements.


✅ Fix #5 — Add KV Cache Quantization to Save VRAM

Add KV cache quantization to reclaim VRAM for compute buffers:

./build/bin/llama-server \
  -m models/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  -ngl 31 \
  --ctx-size 4096 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --no-mmap \
  --port 8081 \
  --jinja

q8_0 KV cache halves the KV buffer VRAM without significant quality loss.


Step Action Expected Outcome
1 git pull + rebuild llama.cpp Fixes SWA crash bug
2 Download Q4_K_M variant Avoids IQ3 CUDA kernel instability
3 Use -ngl 27 + --cache-type-k q8_0 Frees VRAM headroom for compute
4 Set --ctx-size 4096 Usable context without OOM
5 Add --flash-attn on Required for Gemma 4 hybrid SWA layers

The core problem is a combination of an outdated build with unpatched SWA bugs + IQ3_S CUDA kernel instability on your hardware. Updating the build alone may resolve the crash; switching to Q4_K_M will make it reliable.

References:


Back Donate