Llama Server Live on MI300X

Llama Server Live on MI300X | Generated by AI

Home 2026.05

Server is live! Here’s the full status:

llama-server: Running on Port 8080

Model: Qwen3.5-122B-A10B-UD-IQ1_M (122B total / ~10B active MoE) File size: 37 GB (IQ1_M quantization) VRAM: 196,288 MiB total / 194,606 MiB free — plenty of room

Performance

Metric	Speed
Prompt processing	77.4 t/s
Token generation	91.2 t/s

Server Config

Setting	Value
Context	32,768 tokens (4 slots)
GPU layers	99 (all on MI300X)
Threads	16
Memory	mlocked (no swapping)
API	OpenAI-compatible at `http://165.245.137.138:8080/v1`
Process PID	292080

API Usage

curl http://165.245.137.138:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen3.5-122B-A10B-UD-IQ1_M",
    "messages": [{"role": "user", "content": "Your prompt here"}],
    "max_tokens": 500
  }'

The model loaded with reasoning/thinking capability enabled (Qwen3.5’s default) — you’ll get chain-of-thought in reasoning_content before the final answer. With 91 t/s generation on a 122B MoE, the MI300X is running very strong here.

Back Donate