Gemma 4B Q5 Performance Review | Generated by AI

Home 2026.04

Journey & Findings: gemma-4-E4B-it-Q5_K_M.gguf on RTX 4070

What We Started With

The Switch

Killed the old process (PID 135703), launched the new one. Server came up cleanly in ~5 seconds with 4 slots, each allocated 32768 tokens context. No issues during model loading.

Test 1: Moderate Prompt (~7.3k tokens)

Test 2: Heavy Prompt (~26k tokens)

Test 3: Over-Limit Prompt (~76k tokens)

Key Findings

Metric Value
Model size on disk 5.5 GB
Quantization Q5_K_M (high quality)
RSS under load ~6.5 GB
Generation speed ~10.8 tok/s
Prompt eval (cached) 96ms for 26k cached tokens
Thinking model Yes — uses reasoning_content field
Stability No crashes across all tests
Context limit handling Graceful 400 error

Observations

  1. E4B vs 26B-A4B: The E4B model name suggests this might be a smaller/different MoE configuration than the 26B-A4B you were running before. The Q5_K_M quant is significantly higher quality than IQ2_M — less compression artifacts, better output quality at the cost of larger file size.

  2. Thinking model behavior: If you’re using this via an API client, you need to account for thinking tokens eating into max_tokens. Set max_tokens higher (e.g., 1024+) to leave room for actual output after reasoning.

  3. Performance is solid: ~10.8 tok/s is usable for interactive chat. The prompt caching is a nice bonus — repeat/similar prompts get near-instant prompt evaluation.

  4. No crash with 32k context: The main question you asked — it handles long prompts fine. Server stays stable even when requests exceed the limit.


Back Donate