Resolving VRAM Constraints for Gemma

Resolving VRAM Constraints for Gemma | Generated by AI

Home 2026.04

Here’s the full summary:

What we did:

Downloaded gemma-4-26B-A4B-it-UD-IQ3_S.gguf (10.4GB) from HuggingFace via proxychains
Rebuilt llama.cpp from source (latest commit b8-94ca829b6)
Started llama-server with the Gemma 4 26B model

Key findings:

The old build had the SWA (Sliding Window Attention) bool/uint32 bug — fixed by rebuilding
VRAM was the real constraint — with 31 GPU layers, there was zero headroom for compute buffers → OOM crashes
Solution: -ngl 25 (offload 25 of 31 layers to GPU, leave 6 on CPU) + -c 2048 context
This freed ~2GB VRAM for compute, making inference stable

Running config:

-m ./models/gemma-4-26B-A4B-it-UD-IQ3_S.gguf
-ngl 25
-c 2048
--host 0.0.0.0
--port 8081
--path ./tools/server/public

Performance:

Firewall issue:

Current status:

Back Donate