LLM Prefill Benchmark Explanation

LLM Prefill Benchmark Explanation | Generated by AI

Home 2026.07

Question: What does “prefill” mean in the context of LLM inference benchmarks like “LLM prefill (8192x8192): 1,632 TFLOPS”?

Answer:

Prefill is the first phase of autoregressive inference — processing the entire input prompt in one forward pass before generation starts. It’s distinct from decode (generating tokens one at a time after prefill).

Why 8192x8192 specifically

That’s seq_len x seq_len — an 8192-token prompt, so the attention matrix is 8192×8192. The benchmark is measuring throughput for a single forward pass over an 8192-token sequence.

Prefill vs Decode — the key distinction you already care about (KV cache work)

Prefill:
  - Input: all N prompt tokens at once
  - Compute: N tokens processed in parallel (one matmul pass)
  - Bottleneck: COMPUTE-bound (large matmuls, high arithmetic intensity)
  - Output: KV cache for all N tokens + first generated token
  - This is why it's measured in TFLOPS — you're compute-bound,
    so FLOPS/sec is the right metric

Decode:
  - Input: 1 new token at a time (using cached K,V from prefill)
  - Compute: batch=1 (or small batch) matvec, not matmul
  - Bottleneck: MEMORY-BANDWIDTH-bound (loading weights + KV cache
    dominates, arithmetic intensity is low)
  - This is why decode is usually measured in tokens/sec, not TFLOPS

Why the distinction matters practically

For an N-token prompt, self-attention cost is O(N²·d) for QK^T and softmax·V. That’s why prefill scales quadratically with prompt length and is where you actually saturate the GPU’s compute (MI300X’s 1,307 TFLOPS BF16 dense — you’re seeing 1,632 TFLOPS which implies MFU >100%… that’s suspicious unless it’s counting sparse/structured FLOPS or it’s a vendor peak-marketing number, worth checking the source).

Rough FLOPS estimate for prefill of one 8192-token sequence on a model with d_model and n_layers:

def prefill_flops(seq_len, d_model, n_layers, d_ff_mult=4):
    # per layer: QKV proj + attn + O proj + FFN
    attn_proj = 4 * seq_len * d_model**2       # Q,K,V,O projections
    attn_scores = 2 * seq_len**2 * d_model     # QK^T and softmax*V
    ffn = 2 * seq_len * d_model * (d_ff_mult * d_model) * 2  # up+down proj
    per_layer = attn_proj + attn_scores + ffn
    return per_layer * n_layers * 2  # x2 for multiply-add

# e.g. a 7B-ish model, seq_len=8192
flops = prefill_flops(seq_len=8192, d_model=4096, n_layers=32)
print(f"{flops/1e12:.1f} TFLOPs total for one prefill pass")

Given a GPU’s peak TFLOPS/sec, flops / peak_tflops gives you the theoretical prefill latency — this is what you’d use to reason about your MI300X’s expected time-to-first-token when you’re benchmarking your own GPT-2 760M or larger training/inference runs.

The 1,632 TFLOPS number you quoted is a measured throughput, not a fixed cost — it tells you how close the hardware got to peak during that specific 8192×8192 prefill workload, which is the standard way vendors report compute-bound inference performance.

References:

MI300X specifications - AMD

Back Donate