GPU Offload and Context Scaling Results | Generated by AI

Home 2026.04

Comprehensive Summary

Setup Journey

Starting point (from your info):


Step 1 — GPU Layer Increase

Bumped from ngl 20 → ngl 35 (already done when session started), then pushed to full offload:

ngl VRAM used Layers on GPU CPU layers
20 ~2,695 MiB 20 / 43 23
35 ~5,030 MiB 35 / 43 8
43 ~5,182 MiB 43 / 43 0

Going from ngl 35 → 43 (full offload) only cost +152 MiB — essentially free. Zero CPU↔GPU data transfer during inference now.


Step 2 — Context Size: 32k → 64k

Restarted with -c 65536. KV cache doubled but VRAM impact was modest:

Context VRAM (idle) VRAM (after long prompt)
32k ~5,030 MiB ~5,182 MiB
64k ~6,118 MiB ~6,646 MiB

Cost of doubling context: +936 MiB idle, +528 MiB extra during a 60k token inference. Still 5GB+ headroom remaining.


Step 3 — Long Prompt Stress Tests

First test (32k ctx era): 12,020 token prompt → success in 6.9s

64k context tests:

Prompt tokens % of 64k ctx Completion Total time Gen speed
20,025 30% 385 tok 7.5s 51.6 tok/s
40,025 61% 383 tok 14.1s 27.1 tok/s
60,025 92% 444 tok 20.2s 22.0 tok/s

Zero crashes across all tests.


Performance Analysis

Speed vs context depth:

20k prompt → 51.6 tok/s  (baseline)
40k prompt → 27.1 tok/s  (-47% vs 20k) 
60k prompt → 22.0 tok/s  (-57% vs 20k)

This is expected — attention computation is O(n²) in sequence length, so generation slows as more tokens are in the KV cache. The model doesn’t degrade in quality though — responses were coherent at all sizes.

Prefill throughput (rough estimate from total time):

Prefill is fast because it’s parallelized on GPU; generation is the bottleneck.


Final Server Configuration

llama-server \
  -m gemma-4-E4B-it-Q5_K_M.gguf \
  --flash-attn off \
  -ngl 43 \          # full GPU offload
  -c 65536 \         # 64k context
  --host 0.0.0.0 \
  --port 8081
Resource Usage Headroom
VRAM (12,282 MiB total) ~6,646 MiB ~5,600 MiB free
Context 65,536 tokens Up to 131,072 (model max)
GPU layers 43/43 Full offload

Potential next steps if you want more speed:


Back Donate