Mastering llama.cpp CLI Options | Generated by AI

Home 2026.02

Question: Provide a comprehensive introduction to llama.cpp command line options.

Answer:

llama.cpp is a highly efficient C/C++ library for running large language models (LLMs) locally, with several main executables that share most parameters but have different purposes:

The vast majority of important options are common to llama-cli and llama-server.

1. Model Loading – Most Important Flags

2. Context & KV Cache Control

3. GPU / Acceleration Layers

4. Performance / Threading

5. Sampling – Controls creativity & quality

6. Prompt & Chat Control (especially llama-cli)

7. llama-server Specific (OpenAI compatible API)

Fast local chat (good GPU):

./llama-cli -m model-Q5_K_M.gguf -ngl 99 -fa --temp 0.7 --min-p 0.05 --top-k 40 --top-p 0.9 -c 32768 --color -cnv

Max speed server (RTX 4090 / 5090 class):

./llama-server -m model-Q6_K.gguf --n-gpu-layers 999 -fa --flash-attn --ubatch-size 1024 -c 32768 --temp 0.6 --min-p 0.1 --cont-batching --host 0.0.0.0

Very long context CPU-only:

./llama-cli -m model-Q4_K_M.gguf -ngl 0 -c 131072 --rope-scaling yarn --rope-scale 16 --no-mmap --threads 24

Options evolve quickly — always run ./llama-cli --help or ./llama-server --help for the most up-to-date list on your build.

References:


Back Donate