AMD Adopts SGLang for High-Performance Inference

AMD Adopts SGLang for High-Performance Inference | Generated by AI

Home 2026.05

Question: What is SGLang, and why does it appear on the AMD Developer Cloud?

Answer:

What is SGLang?

SGLang (short for Structured Generation Language) is an open-source, high-performance serving framework for large language models (LLMs) and multimodal models. Developed by UC Berkeley and hosted by the LMSYS organization, it combines a Python-embedded frontend language with a highly optimized backend runtime.

Core Architecture

SGLang is built on two fundamental pillars:

1. Backend — RadixAttention

Unlike traditional inference approaches that waste 60–80% of GPU memory on KV cache allocation, SGLang introduces RadixAttention — a radix tree-based prefix caching system that automatically discovers and reuses shared prefixes across requests. This dramatically improves memory efficiency and throughput.

2. Frontend — Structured Programming Language (DSL)

The flexible frontend language provides intuitive and powerful LLM programming interfaces, supporting chainable generation calls, advanced prompt engineering, complex control flow, multimodal input, parallel execution, and external system interaction.

Key Features

SGLang’s fast runtime provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.

It also has extensive hardware support, running on NVIDIA GPUs (GB200/B300/H100/A100), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.

Model Support

SGLang supports mainstream generative models such as Llama, Gemma, Mistral, Qwen, DeepSeek, and LLaVA, as well as embedding models and reward models, with a simple mechanism for extending to new models.

Industry Scale & Adoption

SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Stanford, UC Berkeley, and Tsinghua University. It is currently running on over 400,000 GPUs worldwide.

Why Does It Appear on AMD Developer Cloud?

AMD has deeply integrated SGLang as a first-class inference framework on their platform:

The AMD Developer Cloud offers Quick Start Packages including SGLang — pre-built Docker containers where Jupyter is launched inside the container to give users direct access to the SGLang AI/ML software stack, with example notebooks included to help you get started quickly.
AMD’s ROCm 7 introduces distributed inference in collaboration with the open-source ecosystem including SGLang, vLLM, and llm-d. AMD embraces an open strategy, co-developing shared interfaces and primitives to enable efficient distributed inference on AMD platforms.
AMD provides SGLang-based Docker containers optimized for the MI300X GPU, which developers can use to serve models like Qwen 3.5 with minimal setup on AMD Developer Cloud droplets.
AMD Developer Cloud gives developers raw access to AMD Instinct GPUs to run inference, training, fine-tuning, or any custom workload using frameworks they already know — including vLLM, SGLang, PyTorch, and more.

Performance Comparison

SGLang delivers 29% higher throughput on H100 GPUs compared to vLLM (16,215 vs 12,553 tokens/second) with lower latency (79ms vs 103ms time-to-first-token). Organizations deploying DeepSeek R1 or V3 achieve 2–3x better throughput on SGLang compared to generic inference engines.

Summary Table

Aspect	Detail
Full Name	Structured Generation Language
Origin	UC Berkeley / LMSYS
Primary Use	LLM & multimodal model inference serving
Key Innovation	RadixAttention (prefix caching)
Hardware	NVIDIA, AMD, Intel, TPU, and more
AMD Integration	Pre-built Docker Quick Start on AMD Developer Cloud
License	Open Source

References:

Back Donate