Running Qwen3.5-122B on MI300X | Generated by AI
Question: Can the model unsloth/Qwen3.5-122B-A10B-MTP-GGUF run on an AMD MI300X GPU with llama.cpp?
Answer:
Let’s break this down carefully with numbers:
1. Model Specs
- Qwen3.5-122B → 122 billion parameters
- A10B-MTP → Likely means a mixed precision quantization (possibly 10-bit or custom) + MTP (multi-task pretraining)
- GGUF format → Compatible with
llama.cpp
Estimated memory footprint:
- Dense FP16: 122B × 2 bytes ≈ 244 GB
- With Q4_K / 4-bit quantization: ~30–35% of FP16 → ~73–85 GB
- With A10B-MTP (10-bit + compression): might reduce to ~90–95 GB or slightly higher depending on KV cache and context length
The MI300X has 192 GB VRAM, so in theory it can fit this model in GPU memory with a single instance.
2. Token Context Considerations
-
llama.cpprequires additional VRAM for:- KV cache → grows with
context length × hidden size × #layers - Intermediate activations / working memory
- KV cache → grows with
-
With typical 4K–8K context, you probably need ~30–50 GB extra VRAM.
Estimated total VRAM usage:
Model weights (~90–95 GB) + KV cache & buffers (~50 GB) ≈ 140–145 GB
- Leaves ~50 GB headroom → fine for 8K–16K context
- For very long contexts (>50K tokens), you’ll hit memory limits
3. Throughput (tg128) Expectations
- MI300X token generation is memory bandwidth-limited
-
Rough estimate based on 70B-class benchmarks (40–45 GB weights → 45 t/s):
- Model is ~2× size → throughput roughly halves → 20–25 tokens/sec
- So, realistic throughput for Qwen3.5-122B-A10B-MTP on MI300X: 20–25 t/s per single GPU instance
4. Practical Recommendations
- Use Q4_K or A10B-MTP quantization → reduces VRAM footprint
- Keep context length reasonable (≤16K tokens) for smooth operation
- Enable flash attention in llama.cpp → reduces memory pressure, improves speed
- Multi-instance / batching → MI300X has enough VRAM to run 2–3 concurrent smaller sequences for throughput
✅ Conclusion: Yes, it is feasible to run unsloth/Qwen3.5-122B-A10B-MTP-GGUF on a single MI300X with llama.cpp, but expect moderate throughput (~20–25 t/s) and plan context length accordingly.
References:
If you want, I can make a VRAM map + token throughput estimate table for different context lengths specifically for this 122B GGUF model on MI300X, so you can plan usage precisely. Do you want me to do that?