Optimizing Multi-SLO LLM Serving | Generated by AI

Home 2025.10

Overview

The paper “SLOs-Serve: Optimized Serving of Multi-SLO LLMs” introduces SLOs-Serve, a system designed to efficiently serve large language models (LLMs) in multi-stage applications where each stage (e.g., prefill for input processing, decode for token generation) and application (e.g., chatbots, coding assistants) has specific service level objectives (SLOs). These SLOs ensure low latency for user-facing aspects, like time-to-first-token (TTFT) for prefill and tokens-per-output-time (TPOT) for decode. Traditional serving systems like vLLM or Sarathi-Serve prioritize throughput but often violate these fine-grained SLOs under shared resources, especially during bursts or mixed workloads.

Key Challenges and Contributions

The authors identify challenges in multi-SLO serving:

SLOs-Serve’s contributions include:

Application Prefill SLO Decode SLO Example
Summarization Tight (e.g., 3x slowdown max) Loose (100ms TPOT) Document processing
Coding Loose Tight (50ms TPOT) Code generation
Chatbot Loose Loose Interactive queries
Tool-calling Tight (loops) Tight (loops), loose (final) Agentic workflows
Reasoning Tight (thinking) Tight (thinking), loose (response) Chain-of-thought

System Design

The design explores trade-offs, like larger batches increasing throughput but risking latency (visualized in figures showing SLO-feasible regions).

Evaluation

Tested on 6 scenarios (chatbot, coding, summarization, mixed, tool-calling, reasoning) using real traces (Azure LLM workloads) and datasets (ShareGPT, HumanEval, etc.). Models: OPT-7B/13B/30B, ToolLlama-7B. Hardware: 4×A100 GPUs (main), up to 8×H100.

Figures show SLOs-Serve maintains p99 latencies near targets during spikes, while baselines degrade.

Conclusions

SLOs-Serve advances LLM serving by guaranteeing multi-SLOs in diverse, bursty environments, unlocking 2x+ efficiency gains. Future work could extend to unknown decode lengths or more stages. The system demonstrates that SLO-customized scheduling is key for production-grade LLM deployments.

SLOs-Serve: Optimized Serving of Multi-SLO LLMs


Back

x-ai/grok-4-fast

Donate