Mastering llama.cpp CLI Options

Mastering llama.cpp CLI Options | Generated by AI

Home 2026.02

Question: Provide a comprehensive introduction to llama.cpp command line options.

Answer:

llama.cpp is a highly efficient C/C++ library for running large language models (LLMs) locally, with several main executables that share most parameters but have different purposes:

llama-cli — interactive chat / text generation in terminal (most commonly used for testing)
llama-server — OpenAI-compatible HTTP API server + simple web UI
llama-bench — performance benchmarking
others (llama-perplexity, llama-embedding, llama-eval-callbacks, …)

The vast majority of important options are common to llama-cli and llama-server.

1. Model Loading – Most Important Flags

-m, --model FNAME Path to GGUF model file (.gguf)
-hf, --hf-repo REPO + --hf-file FILE Download model directly from Hugging Face (e.g. -hf bartowski/Llama-3.1-8B-Instruct-GGUF --hf-file Llama-3.1-8B-Instruct-Q5_K_M.gguf)
--mmproj FNAME Path to multimodal projector (for vision/language models like llava, bakllava, obsidian, …)

2. Context & KV Cache Control

-c, --ctx-size N / --context-length N Maximum context length in tokens (default usually 4096 or 8192, 0 = model default)
--no-context-shift Disable context shifting when context is full (newer models often prefer this)
--rope-scaling yarn / --rope-scale N / --rope-freq-base / --rope-freq-scale Advanced RoPE scaling (important for context > original training length)

3. GPU / Acceleration Layers

-ngl, --n-gpu-layers N Number of layers to put on GPU (very important for speed)
- 999 / -1 = try to put everything on GPU
- 0 = CPU only
-fa, --flash-attn Enable Flash Attention (usually faster & lower VRAM usage on modern GPUs)
-sm row / -sm block Split mode for model layers on multiple GPUs (row / block / layer)
--main-gpu INDEX / --tensor-split "0,0.5,0.5,…" Multi-GPU splitting control
--rpc-tensors / --no-rpc-tensors Offload tensors via RPC to another machine

4. Performance / Threading

-t, --threads N Number of CPU threads for generation (default ≈ all physical cores)
-tb, --threads-batch N Threads used for prompt processing / batch (often same as -t or slightly higher)
-b, --batch-size N Logical batch size (default 2048, 512–4096 common)
-ub, --ubatch-size N Physical micro-batch size (often 512–1024, important for very large prompts)
--cont-batching / --no-cont-batching Enable continuous batching (usually faster for server / parallel requests)
--no-mmap Disable memory mapping (sometimes needed on certain file systems or WSL)

5. Sampling – Controls creativity & quality

--temp N Temperature (0.0 = deterministic, 0.7–1.2 = creative, default ~0.8)
--top-k N Keep only top K tokens (20–50 common)
--top-p N Cumulative probability threshold (0.9–0.95 common)
--min-p N Minimum probability relative to the most likely token (0.05–0.1 very popular 2025+)
--tfs N / --typical N / --mirostat 2 --mirostat-tau 5.0 --mirostat-et 0.1 Advanced samplers (mirostat v2 is very good default in 2025–2026)
--repeat-penalty N / --presence-penalty / --frequency-penalty Discourage repetition
--dry family (dry-multipliers, dry-base, dry-sequence-breakers) Newer repetition penalty method (2025+)

6. Prompt & Chat Control (especially llama-cli)

-cnv, --color --no-display-prompt --simple-io Pretty terminal chat mode
--chat-template TEMPLATE / --chat-template-file FILE Choose/alter chat template (chatml, llama-3, mistral, gemma, …)
-s, --simple-io Raw input/output (no fancy formatting)
--reverse-prompt STR / --in-prefix / --in-suffix Custom stop strings / prompt wrapping
-p, --prompt TEXT / -f, --file FILE Initial non-interactive prompt
-n, --n-predict N / -c, --ctx-size N How many tokens to generate (-1 = infinite until stop)

7. llama-server Specific (OpenAI compatible API)

--host 0.0.0.0 --port 8080 Listen address/port
--path PREFIX API base path (default /v1)
--slots / --parallel N Maximum concurrent requests / slots
--embedding Enable embedding endpoint
--reasoning-format deepseek / --jinja Special handling for reasoning models / full jinja templates
--api-key-file / --ssl-key-file / --ssl-cert-file Authentication & HTTPS

8. Quick Reference – Popular 2025–2026 Combinations

Fast local chat (good GPU):

./llama-cli -m model-Q5_K_M.gguf -ngl 99 -fa --temp 0.7 --min-p 0.05 --top-k 40 --top-p 0.9 -c 32768 --color -cnv

Max speed server (RTX 4090 / 5090 class):

./llama-server -m model-Q6_K.gguf --n-gpu-layers 999 -fa --flash-attn --ubatch-size 1024 -c 32768 --temp 0.6 --min-p 0.1 --cont-batching --host 0.0.0.0

Very long context CPU-only:

./llama-cli -m model-Q4_K_M.gguf -ngl 0 -c 131072 --rope-scaling yarn --rope-scale 16 --no-mmap --threads 24

Options evolve quickly — always run ./llama-cli --help or ./llama-server --help for the most up-to-date list on your build.

References:

Back Donate