Local LLM Inference Benchmark: Experimental Design Across 4 Hardware Platforms and 5 Engines

TL;DR: Track B pins the engine, weights, and settings to isolate pure hardware differences across four machines. Track A flips that around and compares engines within a single platform. The design blocks every common trap — prompt cache pollution, prefix reuse, context policy drift, and execution-order bias.

Open Table of contents

Why run this experiment
Four hardware platforms
Models
Two tracks
- Track B — Hardware comparison
- Track A — Engine comparison (within a platform)
Measurement tracks
- Generation — output throughput
- Prefill — input processing throughput
Experimental integrity
Measurement protocol
- Key metrics
Known limitations
Next
Code and data

Why run this experiment

“How fast can I run Qwen3.5-35B on a MacBook?” — answering that honestly takes a surprisingly careful experiment.

It seems like you should just spin up llama-server and measure tokens/sec. In practice:

prompt cache can inflate prefill numbers by 10x or more
what TTFT means varies by backend, so naive comparison is meaningless
different weight formats (GGUF vs MLX) turn an engine comparison into an engine+weights-package comparison
context window size affects gen_tps through KV cache occupancy

This post shares a benchmark design that identifies and blocks every one of these traps.

Four hardware platforms

ID	Machine	Memory	GPU/Accelerator	Notes
`macbook-m-series`	MacBook Pro 14 (M5 Max)	128GB unified	Apple GPU (40 cores)	546 GB/s bandwidth
`linux-5950x-3090x2`	Ryzen 9 5950X + RTX 3090 ×2	128GB DDR4 + 48GB VRAM	CUDA (Ampere)	Discrete GPU, PCIe
`dgx-spark`	NVIDIA DGX Spark (GB10)	128GB unified	Blackwell GPU	273 GB/s, CUDA 13
`ryzen-ai-max-395`	HP Z2 Mini G1a (Strix Halo)	128GB unified (96GB VRAM)	Radeon 8060S (Vulkan/ROCm)	iGPU, 256 GB/s

All four machines carry 128GB of memory, enough to run even the 122B MoE model.

Models

Model	Architecture	Total params	Active params	Context
Qwen3.5-9B	Dense	9B	9B	256K
Qwen3.5-27B	Dense	27B	27B	256K
Qwen3.5-35B-A3B	MoE	35B	~3B	256K
Qwen3.5-122B-A10B	MoE	122B	~10B	256K

Quantization: Q4_K_M (4-bit) and Q8_0 (8-bit), all unsloth GGUF.

Two tracks

Track B — Hardware comparison

Variable: hardware only. Engine, weights, and settings all pinned.

Item	Value
Engine	llama.cpp (identical version)
Weights	unsloth GGUF (Q4_K_M, Q8_0)
Settings	flash_attn=on, batch=512, ubatch=512, no-cache-prompt
Context	Model native (256K) — record as failure on OOM

This is the track that answers “running the same model on a Mac, how many times slower is it than a DGX?”

Track A — Engine comparison (within a platform)

Variable: engine only. Hardware pinned.

Every available backend on each platform:

Platform	Backends
Mac	llama.cpp, Ollama, MLX
3090	llama.cpp, Ollama, vLLM
DGX Spark	llama.cpp, Ollama, vLLM
Ryzen AI	llama.cpp, Ollama, Lemonade

Interpretation scope: Track A is strictly a within-platform comparison. Putting the MLX numbers from the Mac next to the vLLM numbers from Linux and calling it an “engine comparison” is not valid.

Measurement tracks

Generation — output throughput

Track ID	Input	Output
gen-512	64 tok	512 tok
gen-2048	64 tok	2,048 tok
gen-4096	64 tok	4,096 tok
gen-8192	64 tok	8,192 tok

Prefill — input processing throughput

Track ID	Input	Output
prefill-1k	1,024 tok	10 tok
prefill-4k	4,096 tok	10 tok
prefill-16k	16,384 tok	10 tok
prefill-64k	65,536 tok	10 tok
prefill-128k	131,072 tok	10 tok

Experimental integrity

1. Fully disable prompt cache

llama.cpp’s --cache-prompt (on by default) and --slot-prompt-similarity (default 0.10) wreck prefill numbers.

In an early run, llama.cpp 128K prefill showed TTFT of 0.21s and prefill_tps of 574,324 tok/s. That wasn’t prefill — it was KV cache reuse performance.

How to kill it:

--no-cache-prompt              # disable prompt KV cache
--slot-prompt-similarity 0     # disable prefix reuse

vLLM: --no-enable-prefix-caching SGLang: --disable-radix-cache

2. Regenerate the prompt per run (nonce prefix)

Each measurement run gets a fresh random nonce injected at the head of the prompt:

[run:8eovt3an7ge9lbtj96n55f57reqz92gd] The history of computing...

This guarantees:

warmup and measurement see different prompts
consecutive runs within a track see different prompts
no prefix sharing across tracks

3. Cold prefill (server restart)

The server process is restarted between prefill tracks. That fully resets KV cache, CUDA context, and allocator state from the previous track.

4. Enforce native context

Use the model’s native context window (Qwen3.5: 256K) as-is. On OOM, don’t shrink — record it as a failure. That’s how you get an honest answer to “can this hardware actually run 27B with a 256K context?“

5. Randomize execution order

Backend, model, and track order are all shuffled per run. Fixed order introduces bias from thermals, allocator state, and cache warmth.

6. Log OOM and failures

Model load failures, context overflows, server crashes — everything lands in CSV as skip:load_fail, skip:ctx_exceeded, or failed. Knowing which combinations fail is as informative as the ones that succeed.

Measurement protocol

Item	Value
Warmup	1 run (separate prompt, excluded from results)
Measurement	5 runs, median aggregation
Inter-run wait	5s
Inter-track wait	60s
Inter-model wait	120s
Inter-backend wait	60s
Thermal guard	60s cooldown above 85°C

Key metrics

Gen TPS: generation tokens/sec, from after TTFT through the last token.
TTFT: time to first token (ms), measured client-side.
Prefill TPS: input_tokens / (TTFT_seconds), unified client-side definition.
Hit Rate: output_tokens / max_tokens, generation completion rate.

Known limitations

Weight format differences: In Track A, MLX (mlx 4-bit) and llama.cpp (GGUF Q4_K_M) are at the same nominal quantization level but have different implementations. This is closer to an engine+weights-package comparison than a pure engine comparison.
Ollama’s structural TTFT disadvantage: Ollama pre-allocates the full context, so KV cache allocation time lands inside TTFT. At 256K context, that overhead runs into tens of seconds.
Approximate input token counts: We target exact tokenizer-based counts, but fall back to a character-count approximation (3.8 chars/token) if the tokenizer fails to load.
Output quality not validated: hit_rate is a length-completion metric, not a quality metric. Repetition and loops score well on hit_rate too.

Part 2: Performance results across 4 platforms and 5 engines digs into the actual measurement data.

Code and data

The full benchmark code and raw CSVs are open source:

Code: baem1n/llm-bench
Runner: src/runner.py (orchestrator, v3)
Raw CSV (consolidated per device): results/consolidated/
Reproduce: uv run python -m src.runner --config config.yaml