vLLM: Fast LLM Inference Guide | Generated by AI

Home 2025.10

vLLM is an open-source library designed for fast and memory-efficient inference and serving of large language models (LLMs). Developed initially at the Sky Computing Lab at UC Berkeley, it has grown into a community-driven project with contributions from academia and industry. vLLM addresses key challenges in LLM deployment, such as high latency, memory fragmentation, and low throughput, making it ideal for production environments. It supports seamless integration with Hugging Face models and provides an OpenAI-compatible API for easy adoption.

Key Features

vLLM stands out for its performance and flexibility:

These features enable vLLM to achieve state-of-the-art serving throughput while being easy to use.

Prerequisites

Installation

vLLM can be installed via pip. Use uv (a fast Python environment manager) for optimal setup:

  1. Install uv following its documentation.
  2. Create a virtual environment and install vLLM:

    uv venv --python 3.12 --seed
    source .venv/bin/activate
    uv pip install vllm --torch-backend=auto
    
    • --torch-backend=auto auto-selects PyTorch based on your CUDA driver.
    • For specific backends (e.g., CUDA 12.6): --torch-backend=cu126.

Alternatively, use uv run for one-off commands without a permanent environment:

   uv run --with vllm vllm --help

For Conda users:

   conda create -n myenv python=3.12 -y
   conda activate myenv
   pip install --upgrade uv
   uv pip install vllm --torch-backend=auto

For non-NVIDIA setups (e.g., AMD/Intel), refer to the official installation guide for platform-specific instructions, including CPU-only builds.

Attention backends (FLASH_ATTN, FLASHINFER, XFORMERS) are auto-selected; override with VLLM_ATTENTION_BACKEND environment variable if needed. Note: FlashInfer requires manual installation as it’s not in pre-built wheels.

Quick Start

Offline Batched Inference

Use vLLM for generating text from a list of prompts. Example script (basic.py):

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="facebook/opt-125m")  # Downloads from Hugging Face by default
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Online Serving (OpenAI-Compatible API)

Launch a server with:

vllm serve Qwen/Qwen2.5-1.5B-Instruct

This starts at http://localhost:8000. Customize with --host and --port.

Query via curl (completions endpoint):

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Or chat completions:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Using Python (OpenAI client):

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)

completion = client.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    prompt="San Francisco is a"
)
print("Completion result:", completion)

Enable API key auth with --api-key <key> or VLLM_API_KEY.

Supported Models

vLLM supports a vast array of generative and pooling models via native implementations or Hugging Face Transformers backend. Key categories include:

Full support includes LoRA adapters, pipeline parallelism (PP), and V1 engine compatibility for most. For the complete list (over 100 architectures), see the supported models documentation. Custom models can be integrated with minimal changes.

Deployment Options

Docker Deployment

Use the official vllm/vllm-openai image for easy serving:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-1.5B-Instruct

Other options include Kubernetes, AWS SageMaker, or direct integration with frameworks like Ray Serve.

Performance Tuning

To optimize throughput and latency:

For detailed xPU configs, refer to hardware-specific guides.

Advanced Usage

Explore the user guide for engines, engines, and integrations.

References


Back

x-ai/grok-4-fast

Donate