AMD Adopts SGLang for High-Performance Inference | Generated by AI

Home 2026.05

Question: What is SGLang, and why does it appear on the AMD Developer Cloud?

Answer:

What is SGLang?

SGLang (short for Structured Generation Language) is an open-source, high-performance serving framework for large language models (LLMs) and multimodal models. Developed by UC Berkeley and hosted by the LMSYS organization, it combines a Python-embedded frontend language with a highly optimized backend runtime.


Core Architecture

SGLang is built on two fundamental pillars:

1. Backend — RadixAttention

Unlike traditional inference approaches that waste 60–80% of GPU memory on KV cache allocation, SGLang introduces RadixAttention — a radix tree-based prefix caching system that automatically discovers and reuses shared prefixes across requests. This dramatically improves memory efficiency and throughput.

2. Frontend — Structured Programming Language (DSL)

The flexible frontend language provides intuitive and powerful LLM programming interfaces, supporting chainable generation calls, advanced prompt engineering, complex control flow, multimodal input, parallel execution, and external system interaction.


Key Features

SGLang’s fast runtime provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.

It also has extensive hardware support, running on NVIDIA GPUs (GB200/B300/H100/A100), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.


Model Support

SGLang supports mainstream generative models such as Llama, Gemma, Mistral, Qwen, DeepSeek, and LLaVA, as well as embedding models and reward models, with a simple mechanism for extending to new models.


Industry Scale & Adoption

SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Stanford, UC Berkeley, and Tsinghua University. It is currently running on over 400,000 GPUs worldwide.


Why Does It Appear on AMD Developer Cloud?

AMD has deeply integrated SGLang as a first-class inference framework on their platform:


Performance Comparison

SGLang delivers 29% higher throughput on H100 GPUs compared to vLLM (16,215 vs 12,553 tokens/second) with lower latency (79ms vs 103ms time-to-first-token). Organizations deploying DeepSeek R1 or V3 achieve 2–3x better throughput on SGLang compared to generic inference engines.


Summary Table

Aspect Detail
Full Name Structured Generation Language
Origin UC Berkeley / LMSYS
Primary Use LLM & multimodal model inference serving
Key Innovation RadixAttention (prefix caching)
Hardware NVIDIA, AMD, Intel, TPU, and more
AMD Integration Pre-built Docker Quick Start on AMD Developer Cloud
License Open Source

References:


Back Donate