Skip to content
BAEM1N.DEV
Go back

Local LLM Inference Benchmark: Experimental Design Across 4 Hardware Platforms and 5 Engines

Updated:

TL;DR: Track B pins the engine, weights, and settings to isolate pure hardware differences across four machines. Track A flips that around and compares engines within a single platform. The design blocks every common trap — prompt cache pollution, prefix reuse, context policy drift, and execution-order bias.

Table of contents

Open Table of contents

Why run this experiment

“How fast can I run Qwen3.5-35B on a MacBook?” — answering that honestly takes a surprisingly careful experiment.

It seems like you should just spin up llama-server and measure tokens/sec. In practice:

This post shares a benchmark design that identifies and blocks every one of these traps.


Four hardware platforms

IDMachineMemoryGPU/AcceleratorNotes
macbook-m-seriesMacBook Pro 14 (M5 Max)128GB unifiedApple GPU (40 cores)546 GB/s bandwidth
linux-5950x-3090x2Ryzen 9 5950X + RTX 3090 ×2128GB DDR4 + 48GB VRAMCUDA (Ampere)Discrete GPU, PCIe
dgx-sparkNVIDIA DGX Spark (GB10)128GB unifiedBlackwell GPU273 GB/s, CUDA 13
ryzen-ai-max-395HP Z2 Mini G1a (Strix Halo)128GB unified (96GB VRAM)Radeon 8060S (Vulkan/ROCm)iGPU, 256 GB/s

All four machines carry 128GB of memory, enough to run even the 122B MoE model.


Models

ModelArchitectureTotal paramsActive paramsContext
Qwen3.5-9BDense9B9B256K
Qwen3.5-27BDense27B27B256K
Qwen3.5-35B-A3BMoE35B~3B256K
Qwen3.5-122B-A10BMoE122B~10B256K

Quantization: Q4_K_M (4-bit) and Q8_0 (8-bit), all unsloth GGUF.


Two tracks

Track B — Hardware comparison

Variable: hardware only. Engine, weights, and settings all pinned.

ItemValue
Enginellama.cpp (identical version)
Weightsunsloth GGUF (Q4_K_M, Q8_0)
Settingsflash_attn=on, batch=512, ubatch=512, no-cache-prompt
ContextModel native (256K) — record as failure on OOM

This is the track that answers “running the same model on a Mac, how many times slower is it than a DGX?”

Track A — Engine comparison (within a platform)

Variable: engine only. Hardware pinned.

Every available backend on each platform:

PlatformBackends
Macllama.cpp, Ollama, MLX
3090llama.cpp, Ollama, vLLM
DGX Sparkllama.cpp, Ollama, vLLM
Ryzen AIllama.cpp, Ollama, Lemonade

Interpretation scope: Track A is strictly a within-platform comparison. Putting the MLX numbers from the Mac next to the vLLM numbers from Linux and calling it an “engine comparison” is not valid.


Measurement tracks

Generation — output throughput

Track IDInputOutput
gen-51264 tok512 tok
gen-204864 tok2,048 tok
gen-409664 tok4,096 tok
gen-819264 tok8,192 tok

Prefill — input processing throughput

Track IDInputOutput
prefill-1k1,024 tok10 tok
prefill-4k4,096 tok10 tok
prefill-16k16,384 tok10 tok
prefill-64k65,536 tok10 tok
prefill-128k131,072 tok10 tok

Experimental integrity

1. Fully disable prompt cache

llama.cpp’s --cache-prompt (on by default) and --slot-prompt-similarity (default 0.10) wreck prefill numbers.

In an early run, llama.cpp 128K prefill showed TTFT of 0.21s and prefill_tps of 574,324 tok/s. That wasn’t prefill — it was KV cache reuse performance.

How to kill it:

--no-cache-prompt              # disable prompt KV cache
--slot-prompt-similarity 0     # disable prefix reuse

vLLM: --no-enable-prefix-caching SGLang: --disable-radix-cache

2. Regenerate the prompt per run (nonce prefix)

Each measurement run gets a fresh random nonce injected at the head of the prompt:

[run:8eovt3an7ge9lbtj96n55f57reqz92gd] The history of computing...

This guarantees:

3. Cold prefill (server restart)

The server process is restarted between prefill tracks. That fully resets KV cache, CUDA context, and allocator state from the previous track.

4. Enforce native context

Use the model’s native context window (Qwen3.5: 256K) as-is. On OOM, don’t shrink — record it as a failure. That’s how you get an honest answer to “can this hardware actually run 27B with a 256K context?“

5. Randomize execution order

Backend, model, and track order are all shuffled per run. Fixed order introduces bias from thermals, allocator state, and cache warmth.

6. Log OOM and failures

Model load failures, context overflows, server crashes — everything lands in CSV as skip:load_fail, skip:ctx_exceeded, or failed. Knowing which combinations fail is as informative as the ones that succeed.


Measurement protocol

ItemValue
Warmup1 run (separate prompt, excluded from results)
Measurement5 runs, median aggregation
Inter-run wait5s
Inter-track wait60s
Inter-model wait120s
Inter-backend wait60s
Thermal guard60s cooldown above 85°C

Key metrics


Known limitations

  1. Weight format differences: In Track A, MLX (mlx 4-bit) and llama.cpp (GGUF Q4_K_M) are at the same nominal quantization level but have different implementations. This is closer to an engine+weights-package comparison than a pure engine comparison.

  2. Ollama’s structural TTFT disadvantage: Ollama pre-allocates the full context, so KV cache allocation time lands inside TTFT. At 256K context, that overhead runs into tens of seconds.

  3. Approximate input token counts: We target exact tokenizer-based counts, but fall back to a character-count approximation (3.8 chars/token) if the tokenizer fails to load.

  4. Output quality not validated: hit_rate is a length-completion metric, not a quality metric. Repetition and loops score well on hit_rate too.


Next

Part 2: Performance results across 4 platforms and 5 engines digs into the actual measurement data.


Code and data

The full benchmark code and raw CSVs are open source:


AI-assisted content
Share this post on:

Previous Post
Vultr Referral Link: Get $300 in Credits
Next Post
Qwen3.5 Cross-Platform Benchmark: 4 Hardware Targets × 5 Engines Compared