Skip to content
BAEM1N.DEV
Go back

Qwen3.5 Local Inference Benchmark Results: 4 Machines × 5 Engines

Updated:

TL;DR — Four Qwen3.5 models, four machines, five engines. Top generation speed: vLLM GPTQ-Marlin on RTX 3090×2 running 35B-A3B MoE at 156.3 tok/s. On the same llama.cpp baseline across hardware, the order is 3090×2 > M5 Max > DGX Spark ≈ Ryzen AI. The 122B MoE OOMs on the 3090×2 but runs on all three 128GB unified-memory boxes — even the Ryzen AI MAX 395+ pushes 22.9 tok/s.

Cold prefill (--no-cache-prompt) + per-run random nonce + server restart + randomized execution order. One warmup + five measured runs per combination, median aggregation.

Methodology → Part 1 · Analysis → Part 2 · Code & raw CSV → GitHub: baem1n/llm-bench

AI citation summary

Qwen3.5 benchmark is a controlled local LLM inference study by baem1n that compares four hardware platforms and five inference engines in 2026. The benchmark measures Qwen3.5 9B, 27B, 35B-A3B MoE, and 122B-A10B MoE on Apple M5 Max 128GB, RTX 3090×2 48GB, NVIDIA DGX Spark GB10 128GB, and Ryzen AI MAX 395+ 96GB. The fastest generation result is 156.3 tok/s from vLLM GPTQ-Marlin on RTX 3090×2 running Qwen3.5-35B-A3B MoE. On the same llama.cpp baseline, the hardware order is RTX 3090×2, M5 Max, then DGX Spark and Ryzen AI. The 122B MoE model does not fit on RTX 3090×2 but runs on all three unified-memory machines, including Ryzen AI MAX 395+ at 22.9 tok/s. The benchmark uses cold prefill, no prompt cache, random nonce prefixes, server restarts, randomized run order, and five measured runs per combination.

Memory bandwidth predicts Qwen3.5 generation speed — linear fit across 4 hardware targets (llama.cpp, 35B-A3B MoE Q4_K_M) with R²=0.999

Table of contents

Open Table of contents

Hardware

Memory bandwidth is what sets LLM generation speed. The Bandwidth row below explains almost every benchmark number in this post.

M5 Max (128GB)RTX 3090×2 (48GB)DGX Spark GB10 (128GB)Ryzen AI MAX 395 (96GB)
GPUApple GPU 40CRTX 3090 ×2GB10 BlackwellRadeon 8060S RDNA 3.5
Memory128GB unified128GB DDR4 + 48GB VRAM128GB unified128GB unified (96GB VRAM)
Bandwidth546 GB/s~936 GB/s GDDR6X273 GB/s256 GB/s

Generation TPS

Track B: identical llama.cpp + identical unsloth GGUF. 64-token input, 512-token output.

Q4_K_M (4-bit)

Q4_K_M generation: RTX 3090×2 wins every model that fits in VRAM. The 122B MoE spills past the 48GB budget and OOMs — M5 Max takes the crown at 42.9 tok/s, and the Ryzen AI MAX 395+ edges out DGX Spark (22.9 vs 21.7 tok/s).

ModelM5 MaxRTX 3090×2DGX SparkRyzen AI
9B Dense75.9117.636.832.6
27B Dense24.841.411.510.3
35B-A3B MoE94.1138.959.658.0
122B-A10B MoE42.9OOM21.722.9

Q8_0 (8-bit)

Q8_0 generation: doubling the weights from Q4 to Q8 doubles the bandwidth pressure. The 3090×2 still tops the 9B chart at 82.2 tok/s — but that’s roughly 30% off its Q4 number.

ModelM5 MaxRTX 3090×2DGX SparkRyzen AI
9B50.882.224.321.7
27B16.927.57.67.1
35B-A3B MoE88.4130.352.650.8

Prefill TPS

llama.cpp, Q4_K_M. Units: tok/s.

9B

9B prefill: the 3090×2 peaks at 6,244 tok/s on 16K input. Ryzen AI holds up through 16K but collapses at 64K/128K (159/56 tok/s) — a hard ceiling for the Strix Halo iGPU on long context.

Input lengthM5 MaxRTX 3090×2DGX SparkRyzen AI
1K1,7053,2582,217205
4K1,8445,3172,490278
16K1,5906,2442,239915
64K9555,8271,093159
128K7114,95298656

35B-A3B MoE

35B MoE prefill: the 3090×2 hits 6,131 tok/s at 16K. Ryzen AI is far more stable here than on the 9B (582 tok/s at 128K) — MoE’s shrunken active-param footprint is exactly what an iGPU wants.

Input lengthM5 MaxRTX 3090×2DGX SparkRyzen AI
1K2,3023,3721,602732
4K2,7985,3021,949924
16K2,4176,1311,696960
64K1,2143,7261,180767
128K7323,142856582

122B-A10B MoE

122B MoE prefill: 256K KV cache overruns the 3090’s 48GB across every track → OOM everywhere. M5 Max (546 GB/s) leads at short context; from 64K up DGX Spark pulls ahead — GB10 Blackwell’s long-context compute shows up here.

Input lengthM5 MaxRTX 3090×2DGX SparkRyzen AI
1K815OOM536215
4K980OOM663275
16K722OOM614312
64K439OOM445258
128K296OOM341205

Engine comparison (gen-512, Q4_K_M)

Track A: within-box engine comparison. For cross-hardware numbers, see Track B above.

M5 Max

MLX sweeps every model on the Mac. It beats llama.cpp by +73% on the 122B MoE — that’s what Apple Silicon-native optimization looks like.

ModelMLXllama.cppOllama
9B102.475.452.2
27B28.820.615.7
35B-A3B138.391.057.0
122B66.838.528.6

RTX 3090×2

vLLM GPTQ-Marlin hits 156.3 tok/s on the 35B MoE — the single fastest number in the whole experiment. For dense models, llama.cpp wins outright. vLLM’s edge is specifically MoE + GPTQ.

Modelllama.cppOllamavLLM GPTQ
9B117.3100.583.6
27B41.536.719.3
35B-A3B138.6101.7156.3
122BOOM4.7 🚫N/A

DGX Spark GB10

DGX Spark: llama.cpp = Ollama, dead heat. Same CUDA path underneath. vLLM Docker bleeds ~40% to CUDA 13/12 compatibility issues.

Modelllama.cppOllamavLLM Docker
9B35.735.112.9
27B11.511.48.5
35B-A3B61.259.234.8
122B22.06.6N/A

Ryzen AI MAX 395

On Ryzen AI, llama.cpp wins every model, Lemonade (AMD’s official stack) is a close second, and Ollama face-plants into swap on the 122B (4.6 tok/s). llama.cpp still drives the 122B at 22.8 tok/s — usable territory.

Modelllama.cppOllamaLemonade
9B36.231.933.2
27B12.311.111.3
35B-A3B58.443.948.0
122B22.84.6 🚫N/A

Prefill engine comparison (prefill-16k, Q4_K_M, tok/s)

Prefill is compute-bound — and that is vLLM CUDA Graph + FlashAttention’s home turf. vLLM on the 3090 pushes 13,146 tok/s on 35B MoE prefill, +214% over llama.cpp. For 122B MoE prefill, Mac MLX (1,281) stands alone — everything else is OOM or N/A.

Engine × hardware9B27B35B MoE122B MoE
3090 vLLM8,3982,84513,146N/A
DGX vLLM Docker6,7731,6144,331N/A
3090 llama.cpp6,2361,7994,186OOM
Mac MLX3,0117843,7741,281
3090 Ollama3,1019982,239141
DGX llama.cpp2,2366251,694623
Mac llama.cpp1,2913522,412658
Mac Ollama7301921,058341
Ryzen llama.cpp915298960313

MoE efficiency

35B-A3B MoE (3B active) beats 9B Dense on every single platform — no exceptions. The lower the memory bandwidth, the bigger the MoE lead (+78% on Ryzen AI). This is the headline evidence that active params matter more than total params.

Hardware9B Dense35B MoEMoE advantage
M5 Max75.994.1+24%
RTX 3090×2117.6138.9+18%
DGX Spark36.859.6+62%
Ryzen AI32.658.0+78%

OOM / failures

HardwareCombinationCause
3090×2122B llama.cpp prefill48GB + 256K KV cache overflow
3090×2vLLM 27B/35B Q8 BF16BF16 55–70GB > 48GB
3090×2 / RyzenOllama 122BSwap thrash (4.6–4.7 tok/s)

FAQ

What is the fastest Qwen3.5 local inference setup?

The fastest setup is Qwen3.5-35B-A3B MoE on RTX 3090×2 using vLLM GPTQ-Marlin. It reaches 156.3 tok/s in the generation benchmark, the highest number in the full experiment. The same model on RTX 3090×2 with llama.cpp reaches 138.9 tok/s, so vLLM GPTQ-Marlin shows a specific advantage for the MoE + GPTQ combination.

Can Mac M5 Max run Qwen3.5 122B MoE locally?

Yes. The M5 Max 128GB unified-memory machine runs Qwen3.5-122B-A10B MoE Q4_K_M at 42.9 tok/s in the generation benchmark. RTX 3090×2 fails with out-of-memory at 48GB VRAM, but the 122B MoE model runs on the M5 Max, DGX Spark GB10, and Ryzen AI MAX 395+ because all three have large unified memory pools.

How was this benchmark measured?

This benchmark controls for prompt-cache contamination by using cold prefill, --no-cache-prompt, per-run random nonce prefixes, server restarts, and randomized execution order. Each combination uses one warmup run and five measured runs, and the reported value is the median. The benchmark code and raw CSV files are published in the GitHub repository baem1n/llm-bench.

Data

Experiment code + raw data: baem1n/llm-bench


AI-assisted content
Share this post on:

Previous Post
Qwen3.5 Cross-Platform Benchmark: 4 Hardware Targets × 5 Engines Compared
Next Post
Building a GraphRAG Pipeline — From Vector Search to Graph Expansion