nano-vLLM Minimal vLLM Reimplementation | Generated by AI
Here’s what nano-vllm is:
nano-vllm — A Minimal vLLM Re-implementation
It’s a from-scratch, ~1200-line Python reimplementation of vLLM — the high-throughput LLM inference engine. Author: Xingkai Yu. MIT license. Currently targets Qwen3 models.
The goal: show that the core ideas of vLLM (PagedAttention, continuous batching, prefix caching, CUDA graphs, tensor parallelism) can be implemented cleanly in a tiny, readable codebase — and still hit comparable throughput to the real vLLM.
Architecture (6 key components)
1. LLMEngine (engine/llm_engine.py) — The orchestrator
- Owns the Scheduler, ModelRunner, and Tokenizer
generate()loop: add requests → step until done, reporting prefill/decode throughputstep(): scheduler picks seqs → model_runner runs forward → scheduler postprocesses tokens- Supports tensor parallelism via
torch.multiprocessing.spawn— rank 0 drives, ranks 1..N run a shared-memory event loop
2. Scheduler (engine/scheduler.py) — Continuous batching
- Two queues:
waiting(prefill) andrunning(decode) - Prefill scheduling: respects
max_num_batched_tokensandmax_num_seqs, supports chunked prefill for the first seq - Decode scheduling: evicts (preempts) running sequences when KV cache is full — classic vLLM preemption
- Postprocess: appends generated tokens, checks EOS/max_tokens, deallocates finished sequences
3. BlockManager (engine/block_manager.py) — PagedAttention KV cache
- This is the core vLLM innovation, re-implemented here
- Fixed-size KV cache blocks (default 256 tokens/block) allocated from a pool
Blockobjects trackref_countandhashfor prefix cachingallocate(): finds cached prefix blocks (hash-matched) and only allocates new blocks for the remainderhash_blocks(): xxhash-based hashing of token chunks for automatic prefix cache- Preemption: deallocates blocks, puts seq back in waiting queue
4. ModelRunner (engine/model_runner.py) — GPU execution
- Loads Qwen3 model weights via safetensors with packed module mapping (q/k/v → fused qkv_proj, gate/up → fused gate_up_proj)
- Allocates a single contiguous KV cache tensor:
[2, num_layers, num_blocks, block_size, num_kv_heads, head_dim] - CUDA Graph capture for decode: pre-captures graphs at batch sizes [1,2,4,8,16,32,…,512] for zero-overhead replay
- Warmup pass to measure peak memory, then calculates how many KV cache blocks fit in remaining GPU memory
- Tensor parallelism via NCCL + SharedMemory IPC
5. Attention (layers/attention.py) — FlashAttention + Triton KV store
- Triton kernel
store_kvcache_kernel: writes K/V into the paged cache using slot_mapping (no Python loop) - Prefill:
flash_attn_varlen_funcwith variable-length sequences and optional block_table for prefix cache - Decode:
flash_attn_with_kvcachewith paged KV cache
6. Model (models/qwen3.py) — Qwen3ForCausalLM
- Full Qwen3 transformer: QKV parallel linear, RoPE (with
@torch.compile), QK-norm (RMSNorm on Q/K heads), SiLU gated MLP ParallelLMHead: vocabulary-parallel output — gathers logits across TP ranks only at rank 0- Weight loading handles packed modules (fused QKV, fused gate/up)
Supporting layers
- RotaryEmbedding: precomputed cos/sin cache,
@torch.compileon forward - RMSNorm: fused add+norm variant for residual connections,
@torch.compile - SiluAndMul:
@torch.compileon the SiLU gating - Sampler:
@torch.compile, Gumbel-max trick for sampling (exponential noise + argmax)
Key design choices
- Greedy sampling is banned (
temperature > 1e-10enforced) — only stochastic sampling - Block size must be a multiple of 256
- Default 90% GPU memory utilization
enforce_eager=Truedisables CUDA graphs (for debugging)
Benchmark (from README)
RTX 4070 Laptop, Qwen3-0.6B, 256 sequences, 100-1024 input/output tokens:
| Engine | Tokens | Time | Throughput |
|---|---|---|---|
| vLLM | 133,966 | 98.37s | 1361 tok/s |
| nano-vllm | 133,966 | 93.41s | 1434 tok/s |
Essentially matching or slightly beating vLLM on this workload — with 100x less code.
TL;DR
This is a teaching-quality vLLM clone that implements PagedAttention + continuous batching + prefix caching + CUDA graphs + tensor parallelism in ~1200 lines. Great reference for understanding how vLLM actually works under the hood. Currently supports Qwen3 only (hardcoded model class).