Running Qwen3.5-122B on MI300X | Generated by AI

Home 2026.05

Question: Can the model unsloth/Qwen3.5-122B-A10B-MTP-GGUF run on an AMD MI300X GPU with llama.cpp?

Answer:

Let’s break this down carefully with numbers:


1. Model Specs

Estimated memory footprint:

The MI300X has 192 GB VRAM, so in theory it can fit this model in GPU memory with a single instance.


2. Token Context Considerations

Estimated total VRAM usage:

Model weights (~90–95 GB) + KV cache & buffers (~50 GB) ≈ 140–145 GB

3. Throughput (tg128) Expectations


4. Practical Recommendations

  1. Use Q4_K or A10B-MTP quantization → reduces VRAM footprint
  2. Keep context length reasonable (≤16K tokens) for smooth operation
  3. Enable flash attention in llama.cpp → reduces memory pressure, improves speed
  4. Multi-instance / batching → MI300X has enough VRAM to run 2–3 concurrent smaller sequences for throughput

✅ Conclusion: Yes, it is feasible to run unsloth/Qwen3.5-122B-A10B-MTP-GGUF on a single MI300X with llama.cpp, but expect moderate throughput (~20–25 t/s) and plan context length accordingly.


References:


If you want, I can make a VRAM map + token throughput estimate table for different context lengths specifically for this 122B GGUF model on MI300X, so you can plan usage precisely. Do you want me to do that?


Back Donate