vLLM: Efficient LLM Serving Engine | Generated by AI

Home 2025.10

Invention of vLLM

vLLM (short for “virtual LLM”) is an open-source library for high-throughput and memory-efficient inference and serving of large language models (LLMs). It was invented in early 2023 by researchers at the Sky Computing Lab in UC Berkeley’s RISELab. The project began as a specialized inference engine optimized for NVIDIA A100 GPUs and a limited set of models, addressing key challenges in LLM serving like memory fragmentation and low throughput.

Key early milestones:

The GitHub repository (vllm-project/vllm) was created around May–June 2023, aligning with the initial development push.

Rise in Popularity

vLLM started gaining significant traction in 2024, evolving from a niche research tool into the de facto standard for open-source LLM serving. Its popularity exploded due to rapid feature additions (e.g., quantization, speculative decoding, multi-modal support), hardware expansions (NVIDIA, AMD, Google TPUs, etc.), and production adoptions by companies like Amazon (powering Rufus during Prime Day 2024) and LinkedIn.

Key growth indicators from 2024:

By mid-2024, it had ~20,000 stars and was frequently discussed in AI communities for outperforming alternatives in throughput. Momentum continued into 2025:

Today (October 2025), it has over 55,000 stars and supports nearly 100 model architectures, making it a cornerstone for scalable AI deployments.

arXiv Paper on PagedAttention
vLLM GitHub Repository
vLLM 2024 Retrospective Blog
PyTorch Integration Announcement


Back

x-ai/grok-4-fast

Donate