Skip to content
BAEM1N.DEV
Go back

Qwen3.5 Cross-Platform Benchmark: 4 Hardware Targets × 5 Engines Compared

Updated:

TL;DR: Holding the model and weights constant across four hardware targets, RTX 3090×2 wins on raw throughput (139 tok/s on 35B MoE), and the Mac M5 Max delivers the most stable TTFT. The MoE 35B-A3B beats 9B Dense on every platform. vLLM with GPTQ-Marlin hits 156 tok/s — the top number in the entire run.

For the experiment design, see Part 1: Methodology.

Data basis: 1 warmup + 5 measured runs per combination, median reported, CV<0.3 filter, outliers and duplicates removed. Raw CSVs: baem1n/llm-bench.

Table of contents

Open Table of contents

Hardware specs

M5 Max (128GB)3090×2 (48GB VRAM)DGX Spark GB10 (128GB)Ryzen AI MAX 395 (96GB)
GPUApple GPU 40CRTX 3090 ×2GB10 BlackwellRadeon 8060S 40CU
Memory128GB unified128GB DDR4 + 48GB VRAM128GB unified128GB (96GB VRAM)
Bandwidth546 GB/s~936 GB/s GDDR6X273 GB/s256 GB/s

Track B: hardware comparison

The only variable is hardware. Same llama.cpp, same unsloth GGUF, same settings.

Generation speed (gen-512, median tok/s)

Q4_K_M:

ModelM5 Max3090×2DGX SparkRyzen AI
9B Dense75.9117.636.832.6
27B Dense24.841.411.510.3
35B-A3B MoE94.1138.959.658.0
122B-A10B MoE42.9130.721.722.9

Q8_0:

ModelM5 Max3090×2DGX SparkRyzen AI
9B50.882.224.321.7
27B16.927.57.67.1
35B-A3B88.4130.352.650.8

Is MoE (35B-A3B) really faster than 9B Dense?

Yes — on every platform the 35B-A3B MoE (3B active) beats 9B Dense:

Platform9B Dense35B-A3B MoEMoE advantage
M5 Max75.994.1+24%
3090×2117.6138.9+18%
DGX Spark36.859.6+62%
Ryzen AI32.658.0+78%

Which engine is fastest on each hardware?

⚠️ Compare within a platform only. Different engines on different platforms are not directly comparable.

M5 Max: how much faster is MLX than llama.cpp? (gen-512, Q4_K_M)

ModelMLXllama.cppMLX advantage
9B103.275.4+37%
27B28.8
35B-A3B139.091.0+53%
122B66.838.5+73%

RTX 3090×2: does vLLM GPTQ-Marlin beat llama.cpp? (gen-512, Q4_K_M)

ModelvLLMllama.cppOllama
9B83.6117.3100.5
27B19.341.536.7
35B-A3B156.3138.7101.7

DGX Spark: llama.cpp vs Ollama vs vLLM Docker (gen-512, Q4_K_M)

Modelllama.cppOllamavLLM Docker
9B35.735.112.9
27B11.511.48.5
35B-A3B61.259.234.8
122B22.06.6

Ryzen AI: llama.cpp vs Ollama (gen-512, Q4_K_M)

Modelllama.cppOllama
9B36.231.9
27B12.311.1
35B-A3B58.443.9
122B22.84.6

Key findings

  1. 3090×2 wins on raw throughput — GDDR6X at 936 GB/s carries it. 122B MoE at 131 tok/s.
  2. Mac M5 Max is the best daily driver — stable 120ms TTFT. MLX pushes 35B to 139 tok/s.
  3. vLLM GPTQ-Marlin owns the top line — 35B MoE at 156.3 tok/s (+12% over llama.cpp).
  4. DGX Spark is bandwidth-bound — 273 GB/s is half of the Mac’s 546.
  5. Ryzen AI runs 122B — 22.9 tok/s on a mini PC in the $2,000 range.
  6. MoE efficiency is universal — 35B-A3B (3B active) beats 9B Dense by +18–78% on every platform.

OOM and failures

PlatformCombinationReason
3090×2122B llama.cpp prefill48GB + 256K KV overflows
3090×2vLLM 27B/35B Q8 BF16VRAM overflow
3090×2Ollama 27B Q8, 122Bswap (5 tok/s)
DGX SparkvLLM pipCUDA 13/12 mismatch → fixed via Docker
Ryzen AIOllama 122Bswap (4.6 tok/s)

Data

Each combination: 1 warmup + 5 measured runs, median reported. Outliers and duplicates removed, CV<0.3 filter. Model: Qwen3.5, quantization: unsloth GGUF.

PlatformDevice CSV
M5 Max (macbook-m-series)mac.csv
RTX 3090×2 (linux-3090x2)linux-3090x2.csv
DGX Spark GB10dgx-spark.csv
Ryzen AI MAX 395+ryzen-ai.csv
All devicesall_devices.csv

Experiment code + raw data: baem1n/llm-bench


Code: baem1n/llm-bench | Methodology: Part 1


AI-assisted content
Share this post on:

Previous Post
Local LLM Inference Benchmark: Experimental Design Across 4 Hardware Platforms and 5 Engines
Next Post
Qwen3.5 Local Inference Benchmark Results: 4 Machines × 5 Engines