Boundary Labs / Inference Research / NVIDIA Blackwell SM_120

Inference Benchmarks

Complete record of inference optimization research conducted on consumer NVIDIA Blackwell hardware. Systematic autoresearch loops, controlled baselines, logged experiments. All numbers sourced from production hardware and logged with receipts in the public inference-research repo.

~130 tok/s peak (Ornith-35B GGUF, llama.cpp, 2026-06-25)

~101 tok/s warm (AEON Ornith NVFP4, vLLM 0.23, 2026-06-27)

215+ experiments logged

14 model families tested

Research Hardware

All inference experiments run on the cha0tiktower inference node. Single-machine, consumer-tier Blackwell GPU — no NVLink, PCIe interconnect only. Tensor parallelism over PCIe x8+x4 Gen5/Gen4. Full specs at stack.html.

RTX 5060 Ti

GPU (×2)

Blackwell SM_120 · 16 GB GDDR7 each

32 GB

VRAM (TP=2 combined)

GDDR7 · PCIe interconnect

CUDA 13.0.3

Runtime

NVFP4-native · Kernel 6.17

vLLM + llama.cpp

Inference stacks

vLLM 0.19.2rc1 · llama.cpp b5189

SM_120 constraint context: NVIDIA Blackwell consumer GPUs (SM_120) have limited ecosystem support compared to datacenter-class A100/H100. This research specifically characterizes the SM_120 performance envelope — fast CUDA kernel paths, NVFP4 quantization behavior, and TP=2 over PCIe — to establish what is achievable on first-generation Blackwell consumer hardware.

The Consumer Blackwell Envelope

Two RTX 5060 Ti cards serve a 27B in production — Qwen3.6-27B INT4, ~97 tokens a second warm. What gets it there isn't the compute and it isn't VRAM bandwidth — it's two switches most people never flip and one wall NVIDIA built on purpose.

The wall: consumer cards can't do GPU-to-GPU P2P. NVIDIA disables peer access in the driver, so every tensor-parallel all-reduce detours through host RAM instead of going card to card. We measured it — 20.1% of the decode budget, gone. 128 all-reduces a token, 25 microseconds each, pure latency. Not a bandwidth problem. PCIe moves about 250 MB/s during decode, a few percent of even the x4 slot. You don't need a workstation board. You need the lockout gone.

The FP4 silicon is real — the recipe decides whether you touch it. SM_120 has native FP4 tensor cores and vLLM's kernel runs on them. But it only fires for 4-bit activations. The higher-quality W4A16 quants fall back to emulation and eat a 1.7× penalty. The hardware isn't the ceiling. The quantization recipe is.

The two switches: MTP speculative decoding buys +59%. CUDA graphs buy another +60–75%. Turn both off and the box runs at roughly 40% of its number. Most published consumer-Blackwell benchmarks never turned them on — which is why they're low.

The cards aren't the bottleneck. The interconnect and the loader are. Nobody else is measuring this, so here are the receipts.

Measurement note: the lever gains (+59% spec, +60–75% graphs) are single-stream temp-0.3 600-token profiling runs — a distinct condition from the 97 t/s warm greedy production figure. Both are real; they measure different things. Dated 2026-07-10; receipts in the inference-research repo.

Current Production Deployments

Inference configurations on cha0tiktower as of 2026-07-08. All clients route through tower:8010 local proxy; Genesis (Qwen3.6-27B INT4) is the current production backend, verified live. Switch command: proxy-switch genesis|nemotron|aeon|ornith|openrouter.

Configuration	Quantization	Gen	Context	Framework	VRAM	Status
Qwen3.6-27B GPTQ INT4 + MTP n=3 (Genesis)	AutoRound INT4	~97 t/s warm (measured 2026-07-08)	65K	vLLM + Genesis patches	~28 GB (both GPUs)	active — production (restored 2026-07-03)
Ornith-1.0-35B-AEON-Ultimate NVFP4 (Uncensored)	NVFP4A16 (FP4 experts, BF16 attn)	~101 t/s warm / ~124 t/s peak	131K	vLLM 0.23 + SM_120 Marlin patch + fp8 KV	~23 GB (fp8 KV, both GPUs)	trial ended 2026-07-03
Nemotron 3 Nano 30B (MoE hybrid)	Q4_K_M	117.34 t/s med / 117.60 peak	32K	llama.cpp cuda128-clean	24.9 GB (both GPUs)	standby — dated peak 2026-05-20
Qwen3.6-27B PRISM-PRO-DQ (llama.cpp, MTP n=2)	DQ (GGUF)	39.30 t/s	32K	llama.cpp	~28 GB	research eval only — 2026-05-19
AEON (Qwen 3-27B NVFP4)	NVFP4 (NVIDIA ModelOpt)	~69 t/s	122K	vLLM 0.19.2rc1	~20 GB	stopped
Llama 3.3 70B	Q4_0 / 60 GPU layers	55.6 t/s	65K	llama.cpp	~34 GB (OOM ceiling)	historical — VRAM limit
Gemma 4 26B (CPU baseline)	Q4_K_M (CPU-only)	10.4–11.7 t/s	32K	llama.cpp	—	pre-tower era

2026-07-03 backend switch: Ornith production trial ended; Genesis (Qwen3.6-27B INT4, vLLM) restored as the default production backend, confirmed live at ~97 t/s warm on 2026-07-08. The Ornith campaign numbers below stand as dated results.

2026-06-27 backend switch: Ornith-1.0-35B-AEON-Ultimate-Uncensored NVFP4 became the production backend. Deployed to fleet proxy (:8010) on 2026-06-26 as GGUF Q4_K_M (llama.cpp, layer-split dual GPU, ~127–130 t/s peak), then upgraded 2026-06-27 to the AEON NVFP4 variant on vLLM 0.23. Key finding: --kv-cache-dtype fp8 is required for the full 131K context window — AEON's original DGX recipe uses BF16 KV (vision tower forces it), which capped context at ~32K on tower. With --language-model-only --kv-cache-dtype fp8, full 131K context loads cleanly at 338K KV token budget. LangChain tool-use eval (84.8%) was measured on the GGUF variant and re-run on the NVFP4 variant 2026-06-29 — overall held at 84.8% (56/66), with a slightly shifted per-task profile (typewriter_1 19/20 vs 20/20, relational 7/8 vs 6/8) (receipt).

2026-05-20 backend switch: Genesis (Qwen3.6-27B vLLM GPTQ) replaced by Nemotron 3 Nano 30B Q4_K_M via llama.cpp. Nemotron is a Mamba/SSM hybrid MoE architecture — 30B total parameters, ~3B active per token. New peak: 117.60 t/s. The cuda128-clean llama.cpp build is required; the cuda13 build crashes with a CUDA MMQ error (ggml_cuda_mul_mat_q: invalid argument) on first generation after the prompt eval on SM_120. This is a Blackwell SM_120 + cuda13 MMQ kernel incompatibility — not a model issue.

2026-06-06 genesis restoration: vLLM genesis brought back to production alongside Nemotron after resolving three blocking issues: (1) flash_attn.ops module missing from vllm-env — installed flash-attn 2.8.3 with CUDA 13.0 headers; (2) nemotron.service Restart=always kept seizing port 8022 after every stop — fixed with a Restart=no systemd drop-in override; (3) TimeoutStartSec absent from service file — model killed at 90s before completing load. Config upgrades applied: fp8 KV cache (halves VRAM footprint, enables 65K context), qwen3_coder tool-call parser (fixes silent drop bug), explicit SM_120 FlashInfer routing env vars, FlashInfer workspace reduced 413→256 MiB to clear OOM on first MTP inference. Stable at 88 t/s warm. Both Nemotron (117 t/s MoE) and Genesis (88 t/s dense) now available as production backends via proxy:8010.

vLLM 0.20.2 upgrade incident (2026-05-10): Upgrade from 0.19.2rc1 attempted after DFlash investigation confirmed inapplicability to Genesis (Genesis uses self-MTP, not an external draft model). vLLM 0.20.2 failed with NCCL crash on multi-GPU init — SM_120 TP=2 unsupported in 0.20.x. Rollback to 0.19.2rc1 via tarball restore. Complication: kernel upgrade to 6.17.0-23 during reboot locked out NVIDIA driver (modules only built for 6.17.0-22). Fixed by pinning GRUB default. Post-recovery genesis throughput: 71.4 t/s (13% below pre-upgrade 80.8 t/s). Re-optimization pending.

Quality Evaluation

Speed is necessary but not sufficient. Two quality suites validate the production backends across capability dimensions.

Quant quality — NVFP4 W4A16 vs INT4 (2026-07-10)

The higher-quality quant is measurably higher-quality — and now there's a scorecard. Fifteen domain-expert scenarios, logistics and traffic and restaurants, scored blind by two judges. The NVFP4 W4A16 build beat the INT4 production model 4.13 to 3.73. It's the only model of three tested that didn't scramble the menu-engineering matrix, and the only one that refused to write a severe-allergy client "guaranteed nut-free" from a shared kitchen. A third the speed. That's the trade — latency for judgment. Both judges agreed on direction (in-session Claude +0.40, Haiku 4.5 automated +0.20). Scorecard.

Qwen3.6-27B INT4 Genesis — 29-Test Suite (2026-06-06)

Comprehensive coding/math/instruction/tool/knowledge suite run against vLLM genesis with thinking disabled (enable_thinking=false) for latency. MTP n=3 speculative decoding active. Throughput averaged across 29 varied prompts.

89.6%

overall (26/29)

3 benchmark bugs corrected

100%

instruction following

5/5 probes — perfect

100%

tool calling

4/4 — select, args, no-op

63.9

tok/s avg warm

short: 68 · medium: 59.7

Category pass rates

Instruction Following

5/5 (100%)

Tool Calling

4/4 (100%)

Knowledge

4/4 (100%)

Coding (HumanEval-style)

9/10 (90%)

Math / Reasoning

4/6 (67%)

Math gap is intentional: Published Qwen3.6-27B benchmarks (AIME 2026: 94.1%, GPQA Diamond: 87.8%) are measured with thinking enabled. Running with enable_thinking=false for production latency sacrifices deep reasoning chains. The 2 remaining math failures (GSM8K arithmetic, sum-of-multiples) are model errors in no-think mode — not hardware or quantization artifacts. For coding, tool calling, and instruction following the model is production-ready.

Nemotron 3 Nano 30B — 25-Probe Suite (2026-05-20)

A curated quality suite validated the then-production Nemotron backend (2026-05-20 campaign; now standby) across five capability dimensions. Each response scored 1–5 by Claude Haiku acting as judge, using per-probe rubrics with expected outputs. Run against the proxy at 117 t/s.

4.84

overall score / 5.00

25 probes · 5 categories

5.00

reasoning

all 5 probes perfect

5.00

factuality

myths debunked · arithmetic correct

4.60

coding / agent tasks

one token-limit truncation

Category scores (1–5 scale)

Reasoning

5.00 / 5

Factuality

5.00 / 5

Instruction Following

5.00 / 5

Coding

4.60 / 5

Agent Tasks

4.60 / 5

Probe	Category	Score	Result
r1–r5 (all)	Reasoning	5/5 each	Syllogism, bat/ball puzzle, sequences, fallacy detection, pill timing — all perfect.
f1–f5 (all)	Factuality	5/5 each	Great Wall myth, 347×28 arithmetic, Canberra capital, 10% brain myth, Berlin Wall year — all correct.
i1–i4	Instruction Following	5/5 each	Exact list count, raw JSON, two-sentence summary, forbidden-word avoidance — all perfect.
i5	Instruction Following	5/5	Exact structured format (DIFFICULTY / TIME / STEPS). Transient empty response on first query; re-queried correctly.
c1	Coding	3/5	Vowel-counter function correct; docstring quality below rubric threshold.
c2–c5	Coding	5/5 each	Bug identification, palindrome with all normalizations, max() one-liner, Fibonacci explanation — all perfect.
a1–a3, a5	Agent Tasks	5/5 each	JSON extraction, classification array, professional status update, graceful error — all correct.
a4	Agent Tasks	3/5	5-step Python project setup: steps 1–4 correct; step 5 truncated at max_tokens limit (512).

Methodology: 25 hand-crafted probes across 5 categories. Each probe includes expected behavior and a 1–5 scoring rubric. Responses scored by claude-haiku-4-5-20251001 acting as judge. The two 3/5 scores are both partial-credit results (correct output, minor defect) — zero probes failed outright. The one transient empty response (i5) resolved correctly on immediate retry. Source: randomchaos7800-hub/model-eval.

Ornith-1.0-35B vs DeepSeek V3.2 — LangChain Tool-Use Evaluation (2026-06-25)

Head-to-head on the four canonical LangChain tool-use tasks from their Benchmarking Agent Tool Use post: two typewriter variants, multiverse math, and relational data. Ornith-1.0-35B (GGUF Q4_K_M) served directly on port 8030 via llama.cpp (build-cuda120-nographs), layer-split across both RTX 5060 Ti 16GB cards, 131k context, flash attention, q4 KV cache. DeepSeek V3.2 via OpenRouter, routed through the tower fleet proxy at :8010 — same hardware, same day, same prompts. Temperature 0. 30-turn hard cap per case. Full agent loop: model calls tool, result fed back, repeat until stop or cap. 66 total cases.

56/66

Ornith overall (84.8%)

local · ~3s avg latency

57/66

DeepSeek V3.2 overall (86.4%)

OpenRouter · ~22s avg latency

Ornith

Typewriter (1-tool) winner

20/20 vs 11/20 — DeepSeek loops

DeepSeek

Multiverse math winner

18/18 vs 10/18 — Ornith reasons instead

Task	Ornith (local :8030)	DeepSeek V3.2 (OR :8010)	Avg latency	What it tests
Typewriter, 1 tool	20/20 (100%)	11/20 (55%)	Ornith 3.6s / DeepSeek 41.5s	Sequential `type_letter` calls — stop at the right count
Typewriter, 26 tools	20/20 (100%)	20/20 (100%)	Ornith 2.8s / DeepSeek 12.3s	One parameterless function per letter — pick in order
Multiverse Math	10/18 (56%)	18/18 (100%)	Ornith 3.6s / DeepSeek 11.1s	Altered arithmetic — must use tools, ignore pretrained math
Relational Data	6/8 (75%)	8/8 (100%)	Ornith 2.7s / DeepSeek 22.6s	Multi-hop DB-style tool chains over three fake tables
Overall	56/66 (84.8%)	57/66 (86.4%)	~3s / ~22s	DeepSeek wins by one point; profiles are inverted

Task pass rate — Ornith vs DeepSeek V3.2

Typewriter 1-tool (Ornith)

20/20 (100%)

Typewriter 1-tool (DeepSeek)

11/20 (55%)

Typewriter 26-tool (both)

20/20 (100%)

Multiverse Math (DeepSeek)

18/18 (100%)

Multiverse Math (Ornith)

10/18 (56%)

Relational Data (DeepSeek)

8/8 (100%)

Relational Data (Ornith)

6/8 (75%)

Typewriter failure analysis (DeepSeek V3.2 — 9 failures confirmed from JSON): The 1-tool typewriter test exposes a specific failure mode: not knowing when to stop. DeepSeek looped to the 30-turn cap on a (typed 30 a's), aa (30 a's), house (typed it five times), horse (five times), school (five times), computer (looped), dictionary (looped). On information it went completely off-rails into unrelated text. On aaaa it typed five characters instead of four. Average latency on failures: 70s — nearly the full 30-turn ceiling. LangChain's original 2023 blog documented GPT-4 failing on this same task; DeepSeek V3.2 shows the identical pattern. On the 26-tool variant — where each letter has its own named, parameterless function — both models scored 20/20. Selecting from a named set is a different cognitive task than deciding when to stop calling the same function.

Ornith failure analysis (8 failures): All multiverse math failures are multi-step composition or pretrained-knowledge override. Ornith emits long reasoning_content traces before acting — on single-hop problems this still terminates correctly; on 3+ hop chains (ecoli division, 1+2+3+4+5 via add function, chained logs) it sometimes determines the answer from training and skips tool calls entirely. Two relational failures: one llama-server 500 on a quoted entity name ("Frank The Cat" — malformed tool-call JSON) and one reasoning trace with no tool calls on a multi-hop weather lookup. Mitigation: force tool-use system prompt, raise max_tokens to ≥1024 on multi-hop tasks, sanitize entity names in tool schemas.

Routing implication: The overall scores are nearly identical (84.8% vs 86.4%) but the profiles are inverted. Ornith for high-frequency sequential tool agents — local, no API cost, 3s latency, no looping on stop conditions. DeepSeek for multi-hop reasoning chains and relational lookups where correctness and tool-rule compliance matter more than speed. Neither model wins everything. Route by task shape, not model religion. Receipts: verdict · DeepSeek run JSON · harness: langchain-brutal-eval.py.

Ornith-1.0-35B — Speed Suite, HumanEval & Context Scaling (2026-06-25–27)

Throughput benchmarks across task categories, HumanEval code completion, and context scaling from 1K to 32K prompt tokens. Speed suite run on June 25 (GGUF Q4_K_M direct on :8030, dual GPU) and repeated June 27 (production proxy :8010 warm, after NVFP4 upgrade). Context scaling and canonical 512-gen both via prod proxy :8010 on June 27 (NVFP4 variant).

~130

tok/s peak (GGUF, short ctx)

direct :8030 · June 25

~124

tok/s peak (NVFP4 prod, code)

via :8010 warm · June 27

43.9%

HumanEval pass@1

72/164 · markdown fence leakage

131K

max context

fp8 KV + language-model-only

Task category	GGUF direct :8030 (June 25, gen t/s)	NVFP4 prod :8010 (June 27, gen t/s)	TTFT (June 27)
Short math	129.7 t/s	93.9 t/s	55 ms
Medium attention	127.2 t/s	92.3 t/s	117 ms
Code / CSV	126.9 t/s	123.8 t/s	48 ms
Agentic JSON	127.5 t/s	124.1 t/s	110 ms

GGUF vs NVFP4 throughput: GGUF Q4_K_M direct on :8030 runs faster on short-context math/attention tasks (~130 vs ~93 t/s), while NVFP4 on vLLM 0.23 via proxy shows parity or slight advantage on code and JSON tasks. The NVFP4 advantage shows at longer outputs and higher concurrency — vLLM's scheduler amortizes batch setup cost differently than llama.cpp single-stream. Both variants use the same hardware (dual RTX 5060 Ti) and context window (131K).

Context Scaling — Ornith NVFP4 via :8010 (2026-06-27)

Throughput at increasing prompt lengths. Each point: 128 completion tokens, median of multiple runs.

Prompt size (tokens)	Gen t/s (median)	TTFT (ms, median)	Elapsed
~1K (1,227 tok)	87.2 t/s	440 ms	1.75 s
~4K (5,307 tok)	84.0 t/s	1,747 ms	3.45 s
~8K (10,827 tok)	23.1 t/s	3,494 ms	8.09 s
~16K (21,627 tok)	21.6 t/s	4,918 ms	9.12 s
~32K (43,227 tok)	46.2 t/s	9,825 ms	7.48 s

Context scaling pattern: Generation throughput holds well at 1K–4K (84–87 t/s) then drops sharply at 8K–16K (21–23 t/s), with a partial recovery at 32K (46 t/s). The 8K–16K trough is likely a KV cache scheduling boundary in vLLM 0.23 on the SM_120 hybrid architecture (GatedDeltaNet attention + fp8 KV interaction). The 32K recovery may reflect larger prefill batches amortizing decode overhead. Requires further investigation. For agent workloads with short context, full 87+ t/s throughput is available.

HumanEval (GGUF Q4_K_M, 2026-06-25): 72/164 (43.9% pass@1). Primary failure mode: markdown code fences in completions (```python prefix in the returned body) — the GGUF chat template emits fences even on raw completion tasks, causing indentation errors and SyntaxError when the harness strips the first line. Functional reasoning is stronger than the score implies. Confirmed by partial-credit review: logic is correct in most failures; extraction artifacts cause the failures, not semantic errors. Receipts: humaneval-results.json.

Mixture-of-Experts on Consumer Blackwell

The pitch: sparse activation should let a big MoE model match a small dense model's speed without dense's quantization tax on quality. Six checkpoints, four quantizations, two runtimes, three unrelated base-model families later — the speed half of that pitch is true. The quality half isn't. Every MoE checkpoint tested beat Genesis on raw throughput. Every one of them lost to it on quality, by a margin the dense-side quantization spread never came close to.

Model	Runner	Speed	Quality (/5)
GPT-OSS-20B (native MXFP4)	llama.cpp, all-GPU	133.05 t/s	2.20
Nemotron-3-Nano-30B-A3B	llama.cpp, all-GPU	123.4 t/s	2.80
Qwen3-30B-A3B (Q3_K_M)	llama.cpp, all-GPU	121.5 t/s	1.87
Qwen3.6-35B-A3B	llama.cpp, all-GPU	100.54 t/s	2.87*
— Genesis (dense, INT4+MTP) —	vLLM, production	97.4 t/s	3.73
Qwen3-30B-A3B (GPTQ-Int4)	vLLM	41.66–42.41 t/s	2.67

*Qwen3.6-35B-A3B's 2.87 has a real confound — 10 of 15 probe responses truncated at the token budget that run used. Every later phase raised the budget; this one wasn't rerun. Treat with caution.

The starkest result isn't the average — it's the inversion. Qwen3-30B-A3B's fastest configuration (Q3_K_M, everything on-GPU, 121.5 t/s) is also its worst-quality configuration (1.87/5) — the single fastest MoE result of the whole investigation is also the single worst. Nothing on the dense side did that. Speed and quantization aggressiveness traded off gently there; here they inverted.

Two scenarios broke every model the same way. A signalized intersection with a "yellow trap" left-turn conflict, and a roundabout redesign that ignores rail-crossing queue-spillback risk — both failed identically across Qwen, NVIDIA/Nemotron, and OpenAI's GPT-OSS. Three unrelated base-model families, four quantizations, two runtimes, one shared failure shape: a long, confident, well-organized answer that fabricates a plausible wrong mechanism and lands backwards on the one constraint that actually mattered. GPT-OSS-20B — the smallest model tested — failed hardest: it flatly asserted the roundabout's rail-crossing interaction was "None (independent)," the most confident wrong answer of the whole run, and separately told an operator to reheat and re-serve temperature-abused soup. That's not a benchmark score. That's a real failure mode, and it doesn't show up on the dense side at this parameter count.

Read on this: two quantizations, two runtimes, and three base-model families all failing the identical scenario the identical way isn't a per-checkpoint bug. It reads as a genuine capability floor at the 20–35B parameter class — MoE routing doesn't route around it.

One real infrastructure bug turned up, and it's a good story. A vLLM 0.25.0 load of Qwen3-30B-A3B-NVFP4 appeared to hang after weight-loading — idle workers, zero GPU utilization, a shm_broadcast warning firing every 60 seconds. First guess: a TP/expert-parallel synchronization stall. Wrong — retested with Genesis's own production NCCL settings, still hung. Second guess: auto-enabled expert-parallel. Also wrong — the flag defaults off and was never set; the log line that looked like evidence of it prints unconditionally regardless. The actual cause, found by checking ps aux instead of treating the hang as a black box: 91 concurrent nvcc/cicc/ptxas processes. FlashInfer has no precompiled kernel for this SM120+fp8_uint4 combination, so it JIT-compiles on first load — uncapped, on an 8-core box, which drove real RAM/swap thrashing, not a deadlock. Capping MAX_JOBS fixed the thrash, but the compile itself still ran past 12 minutes wall-clock. One line in a launch script, and a lesson permanent enough to make case law: an uncapped JIT compile storm looks exactly like a deadlock until you check what's actually running.

No wall turned up that would justify writing Genesis-style custom patches for the MoE path — llama.cpp served every checkpoint cleanly once configured correctly, and it beat vLLM outright for MoE, the opposite of the dense-model result. Verdict: Genesis stays the production pick. MoE is real, sparse activation is real, the speed is real — on this hardware, for this workload, it isn't a free lunch. It's a different bill.

Full writeup, phase-by-phase raw data, and every receipt: FINAL-REPORT-2026-07-14.md. Dated 2026-07-14.

Autoresearch Optimization Campaigns

Each campaign runs an automated LLM-driven hypothesis loop — generate config variant, benchmark, evaluate stopping criterion (+5 t/s improvement threshold), iterate. Full logs and raw TSV data available at github.com/randomchaos7800-hub/inference-research.

Qwen 3.6-27B Dense — vLLM Passes 1+2

GPTQ-Marlin INT4 · TP=2 · 33 experiments

baseline74.35–75.63 t/s

best result80.59 t/s

delta+8.4%

key techniqueMarlin atomic-add

VLLM_MARLIN_USE_ATOMIC_ADD=1 adds +6.25 t/s — enables atomic-add reduce in gptq_marlin kernel for small-n decode on TP=2. Single env var, zero model change. KV dtype=auto (fp16) eliminated per-layer dequant overhead (+4.95 t/s in Pass 1).

GPTQ-MarlinTP=2

SuperGemma4-26B — llama.cpp Single GPU

Gemma 4 26B · RTX 5060 Ti ×1 · 23 experiments

baseline61.7 t/s

best result71.1 t/s

delta+15.2%

key techniqueCPU layer offload + threading

Reducing CPU layers 10→4, threads 8→4, parallel=2 at E-core affinity (mask 0x3F5) extracted maximum from Alder Lake heterogeneous core topology. SWA cache contention was the primary bottleneck — eliminating CPU-side routing resolved it.

SWA cacheE-core affinity

AEON — Qwen 3-27B NVFP4

NVIDIA ModelOpt NVFP4 · vLLM · 14 experiments

baseline68.86 t/s

best result69.75 t/s

delta+1.3%

key techniquebatch=8192 (marginal)

NVFP4 is Blackwell-native quantization (CUDA 12.8+ required) — 4× smaller KV footprint vs bf16. At this quantization level the baseline is already near-optimal on SM_120. 4 of 14 runs hit timeout at extreme batch configs. Model runs cleanly at 122K context.

NVFP4SM_120-native

Qwen 3.6-27B — Context Scaling (Pass 3)

32K → 128K context · 18 experiments

baseline80.65 t/s @ 32K

at 128K context80.04 t/s

degradation−0.8%

context headroom4× expansion

fp8 KV cache enables 128K context on 32 GB VRAM with less than 1% throughput degradation. The KV footprint scales linearly with context; fp8 halves the cost vs bf16, making long-context workloads viable without throughput sacrifice.

fp8 KV128K context

Qwen 3.6-35B-A3B MoE — llama.cpp Dual GPU

131K context · TP=2 · 22 experiments

baseline100.24 t/s

best result100.24 t/s

delta0%

findingGDN ceiling

22 experiments, zero improvements. MoE at 100 t/s on TP=2 has hit the hardware ceiling — expert routing serializes over PCIe, and no config adjustment can eliminate that bottleneck. The GDN (gating-dispatch-normalization) loop is sequential. This is a fundamental PCIe-vs-NVLink constraint finding.

MoE routingPCIe bottleneck

GLM-4.7-Flash — vLLM Dual GPU

GLM-4.7B Flash variant · TP=2 · ~15 experiments

baseline95.90 t/s

best result97.07 t/s

delta+1.3%

key techniqueq5_0 KV quantization

q5_0 KV outperformed both q4_0 and f16 — an atypical result. At this model scale the KV bandwidth is the binding constraint, and q5_0 hits the optimal precision-vs-compression tradeoff. Confirms that KV dtype tuning is model-architecture dependent, not universally monotonic.

KV quantFlash attention

Llama 3.3 70B — llama.cpp Dual GPU

Q4_0 · TP=2 · 10 experiments

baseline55.6 t/s

best result55.6 t/s

experiments failed10/10 OOM

hardware limitVRAM ceiling

70B at Q4_0 uses ~34 GB VRAM — right at the 32 GB ceiling with Flash attention overhead. Every optimization attempt that touched context size or cache dtype triggered OOM. Baseline 55.6 t/s is usable but unoptimizable on this hardware tier. 70B+ requires more VRAM or quantization below Q4.

VRAM limit70B boundary

Gemma4 26B AWQ — vLLM Single GPU

AWQ 4-bit · RTX 5060 Ti ×1 · 5 experiments

baseline latency32.77 ms/tok

best result32.77 ms/tok

timeouts2/5

outcomediscontinued

AWQ was tested as an alternative quantization path to GPTQ-Marlin on SM_120. 2 of 5 runs hit timeout at extreme batch configs. Remaining runs returned neutral outcomes with no improvement path visible. GPTQ-Marlin was confirmed as the superior quantization approach for Blackwell.

AWQvs Marlin

Nemotron 3 Nano 30B — llama.cpp Format Trials

NVIDIA MoE hybrid · 30B total / ~3B active · 2026-05-20

winnerQ4_K_M, llama.cpp cuda128-clean

median117.34 t/s

peak117.60 t/s

smoke tests4/4 pass

VRAM24.9 GB (11.8 + 13.1 GB)

prefill1,200–2,500 t/s

Q4_K_M via llama.cpp cuda128-clean is the confirmed path. The cuda13 build fails on SM_120: CUDA MMQ kernel crash (ggml_cuda_mul_mat_q: invalid argument) after the first inference, every time. The vLLM FP8 TP=2 path also failed — wrong flag (--disable-log-requests removed in newer vLLM) and NCCL engine init failure. Q4_K_M at 117 t/s set the production peak at the time and ran as default backend via proxy :8010 through mid-May; Nemotron is now standby (genesis is the current default).

Nemotronllama.cppSM_120

Qwen3 Quantization Shootout — 5 Formats

vLLM 0.21.0 + Genesis patches · TP=2 · 2026-05-16

genesis baseline73.35 t/s (27B INT4 + MTP n=3)

Qwen3-14B FP843.54 t/s — clean load

Qwen3-30B-A3B GPTQ41.33 t/s — cpu-offload req'd

Qwen3-32B GPTQ dense5.25 t/s — PCIe ceiling

NVFP4 (×2 models)was blocked — CUDA 13.0.3 now installed

The dense 32B vs MoE 30B comparison is the key finding. Both models are 16 GB on disk; both require --cpu-offload-gb 1.0 to initialize (Marlin repacking peak creates ~200 MiB initialization overage). But throughput differs by 8×: 5.25 vs 41.33 t/s. Dense 32B: all 32B weights accessed per token, PCIe bus is saturated. MoE 30B-A3B: only ~3B active params per token, PCIe pressure is ~10× lower. NVFP4 was blocked by FlashInfer's CUTLASS SM_120 FP4 kernels requiring CUDA ≥ 12.9; this system ran 12.8 at the time. System upgraded to CUDA 13.0.3 on 2026-06-06 — CUDA version barrier cleared; Qwen3 NVFP4 on SM_120 is now a candidate for re-evaluation.

FP8GPTQ-MarlinMoE vs DenseNVFP4 blocked

PRISM-PRO 27B — MTP Speculation Sweep

Qwen3.6-27B-PRISM-PRO-DQ · llama.cpp · 15 experiments · 2026-05-19

baseline (MTP n=1)34.78 t/s

winner (MTP n=2)39.30 t/s (+13.0%)

no MTP (sanity)26.25 t/s (−24.5%)

threads_420.94 t/s (−39.8%)

flash-attnFAIL — server timeout 120s

MTP n=2 is the ceiling for this model: n=3 gives only 35.13 t/s (+0.35 vs baseline), likely because the third speculative token is rejected too often to pay for itself. The no-MTP sanity check confirms MTP contributes ~8.5 t/s to the baseline score. Flash-attn failed to load — server did not come up within 120s, consistent with the SWA hybrid memory architecture (SSM layers conflict with standard flash-attn kernel path). Wall-clock throughput (37.82 t/s at MTP n=2) is close to token-gen throughput because this model's SSM decode path is not fill-dominated.

MTPspeculative decodingSSM hybrid

Qwen2.5-7B-Instruct Q4_K_M — MacBook Air M4

Metal backend (MTL,BLAS) · llama.cpp 9270 · 25 experiments · 2026-05-22

baseline (cold)209.79 pp / 20.91 tg

best tg (t=1, warm)21.10 t/s

MLX pp throughput70.44 t/s — rejected (3× slower)

thermal cost (20 min)−24% pp (209 → 152 t/s)

t=1 is monotonically optimal with full GPU offload (-ngl 99): GPU handles all matrix math, extra CPU threads only add unified memory bus contention. Flash-attn on: +9.2% pp, +10.2% tg. MLX rejected despite being Apple-native — pp 3× slower than llama.cpp, marginal tg gain not worth it. Qwen3-8B (MLX) rejected: 80% slower pp, 12% slower tg. Metal flash-attn V-cache quantization hard-blocked (V quant incompatible with Metal flash-attn kernel path). No source build tested — native M4 mcpu flag would likely recover 15-25%. Winning config: -ngl 99 --flash-attn on -t 1

MetalApple SiliconMLX vs llama.cpp

Nemotron 3 Nano 30B — Parameter Sweep

Q4_K_M · llama.cpp cuda128-clean · 16 experiments · 2026-05-21

baseline123.1 t/s

winner123.6 t/s (ctx=65536)

delta+0.4% / free win

threads=1677 t/s, stddev 30 — unstable

flash-attn on123.6 t/s — no gain

16-experiment sweep across ctx-size, threads, cache-ram, and flash-attn. ctx=65536 (2× baseline) matches baseline throughput while doubling available context — free win, deployed to production. threads=16 caused severe instability (stddev 30, worst 45 t/s) due to Mamba SSM's sequential state dependency conflicting with hyperthread contention. flash-attn has no effect on Mamba hybrid architecture — the attention-free SSM decode path is unaffected. Model is at ceiling for this hardware tier.

Mamba hybridMoESSM

Qwen3.6-27B Genesis — Production Restoration

vLLM 0.21.0 + Genesis patches · fp8 KV + MTP n=3 · 2026-06-06

prev best (pre-incident)80.8 t/s

restored warm throughput88 t/s

avg across 29 benchmarks63.9 t/s

quality (no-think)89.6% (26/29)

context window65K (fp8 KV)

Three blockers cleared: flash_attn 2.8.3 installed (cuda13 headers), nemotron.service port seizure stopped (Restart=no drop-in), TimeoutStartSec=300 added to prevent 90s systemd kill during model load. Config hardened: fp8 KV cache, qwen3_coder tool parser (fixes silent tool-drop bug), SM_120 FlashInfer routing env vars, FlashInfer workspace 413→256 MiB (OOM fix on first MTP inference). Benchmark data: 100% instruction following, 100% tool calling, 90% coding. Math at 67% without thinking — expected; published numbers require <think> chains.

genesisfp8 KVMTP n=3GPTQ-Marlin

Ornith-1.0-35B — LangChain Tool-Use Evaluation

deepreinforce-ai/Ornith-1.0-35B-GGUF Q4_K_M · llama.cpp cuda120-nographs · 2026-06-25

typewriter 1-tool20/20 (100%) — beats DeepSeek V3.2

typewriter 26-tool20/20 (100%)

multiverse math10/18 (56%) — multi-hop gap

relational data6/8 (75%)

overall56/66 (84.8%) · ~3s avg latency

vs DeepSeek V3.257/66 (86.4%) · ~22s avg latency

35B GGUF on dual RTX 5060 Ti outperforms DeepSeek V3.2 on sequential tool composition — the task where GPT-4 famously fails in LangChain's original paper. Perfect typewriter on both variants; DeepSeek loops to the 30-turn cap on 9 of 20 single-tool cases. DeepSeek wins on multiverse math (altered arithmetic rules, must ignore pretrained knowledge) and relational data (multi-hop DB lookups). One-point overall margin hides inverted profiles: Ornith for high-frequency hot-path tool routing at 3s and zero API cost; DeepSeek for hard-path reasoning chains where correctness beats latency. Serving config: :8030, --ctx-size 131072, --cache-type-k q4_0, --flash-attn on, --tensor-split 1,1.

tool-use evalhead-to-headrouting

Ornith-1.0-35B GGUF — Production Speed Suite

Q4_K_M · llama.cpp cuda120-nographs · layer-split dual GPU · 2026-06-25–26

short math peak129.7 tok/s

code / CSV126.9 tok/s

agentic JSON127.5 tok/s

canonical 512-gen87 tok/s (via :8010)

HumanEval43.9% (72/164) — fence leakage

context floor21 t/s @ 8–16K ctx

GGUF Q4_K_M on dual 5060 Ti via llama.cpp layer-split peaks at ~130 t/s on short-context tasks — new hardware throughput record, exceeding Nemotron 30B (117 t/s). Context scaling shows a 8K–16K trough (21–23 t/s) with partial recovery at 32K (46 t/s), attributed to KV scheduler behavior at this context range. HumanEval 43.9% is artificially low — markdown fence leakage in completions causes harness-level SyntaxError failures where the underlying logic is correct. Rolled to production 2026-06-26 and then upgraded to NVFP4 variant 2026-06-27.

GGUFllama.cppthroughput record

Ornith-1.0-35B-AEON-Ultimate NVFP4 — vLLM 0.23 Production Deploy

NVFP4A16 (FP4 experts, BF16 attn) · vLLM 0.23 + SM_120 Marlin patch · 2026-06-27

warm throughput (:8010)~101 tok/s short-gen

code / JSON peak123–124 tok/s

context window131K (fp8 KV enabled)

VRAM~23 GB (fp8 KV, both GPUs)

LangChain evalnot re-run (GGUF score 84.8%)

x86 port of AEON's DGX Spark serving recipe onto the SM_120 tower. Key blocker: AEON's guide assumes BF16 KV cache (vision tower forces it on DGX), which capped context at ~32K on tower. Adding --language-model-only --kv-cache-dtype fp8 restores full 131K context with 338K KV token budget. Kernels: Marlin linear (auto) + flashinfer_b12x MoE dispatch + a local SM_120 Marlin compatibility patch. vLLM 0.23 TP=2 on SM_120 is stable with these patches applied. NVFP4A16 weight format: FP4 quantization on experts/MLP, BF16 on attention layers. LangChain brutal eval re-run pending — GGUF score (84.8%) is the current quality baseline.

NVFP4SM_120vLLM 0.23131K ctx

Gemma 4 12B IT — New Model Evaluation

Q8_0 GGUF (12.7 GB) · llama.cpp + vLLM + SGLang · 19 experiments · 2026-06-03

baseline (llama.cpp)30.9 t/s

best result30.9 t/s

delta (19 experiments)0% — fully flat

split-mode row−10%

split-mode tensorCRASH (b9245 bug)

vLLM 0.21.0OOM — 114 MiB short

SGLang 0.5.12arch not recognized

Dense 12B at Q8_0 is pure VRAM bandwidth bound. 19 autoresearch experiments across split modes, threads 8–20, ubatch 512–4096, and KV quant measured flat at 30.9 t/s ±0.2. split-mode tensor crashes on Gemma 4 in llama.cpp b9245; row split is −10% from PCIe sync overhead. vLLM 0.21.0 and SGLang 0.5.12 have no native Gemma4UnifiedForConditionalGeneration implementation — both fall to the Transformers fallback, which carries ~230 MiB overhead that pushes fp8 single-GPU loading over the 15.47 GiB GPU limit by exactly 114 MiB. TP=2 via Transformers fallback also crashes (tensor reshape failure after weight sharding). llama.cpp handles it cleanly at 30.9 t/s. That is the ceiling until native engine support ships.

new modelbandwidth boundecosystem gap

DiffusionGemma 26B A4B IT — Block Diffusion Debut

Google DeepMind · Q4_K_M GGUF (16 GB) · llama.cpp PR #24423 · 2026-06-10

baseline (n=256, dual GPU)41 tok/s · 248ms/step · 25 steps

kv_cache on (best flag)48 tok/s · 22 steps (+17%)

n=1024 + kv + alg2121 tok/s (canvas scaling)

algorithm sweep (alg0–4)margin-based ≈ confidence-based

single GPUblocked — Q4KM 300 MB over 16 GB limit

Block-diffusion architecture: iterative canvas denoising over 256-token blocks in parallel — not autoregressive. Requires llama.cpp PR #24423 (not in main; current master returns unknown model architecture: diffusion-gemma). Build target: CUDA 12.8 / sm_120 cuda128-clean. Key flag: --diffusion-kv-cache on is auto-disabled on 2-GPU splits as a conservative default — forcing it on yields +17% and reduces step count from 25 → 22 with no quality degradation observed. The architecture's native strength is longer outputs: per-step cost is nearly constant regardless of canvas size, so n=1024 delivers 2.5× the throughput of n=256. Single-GPU blocked by 300 MB headroom gap (Q4KM needs 16.0 GB, card has 15.7 GB free) — would unlock Flash Attention + KV together for an estimated further 20–30% gain.

block diffusionnew architecturecanvas scaling

Nemotron 3 Nano 30B NVFP4 — TRT-LLM Benchmark

NVFP4 · TensorRT-LLM 1.2.1 · hardware-limited · 2026-05-21

llama.cpp baseline123.6 t/s

TRT-LLM resultDID NOT RUN — OOM

model weights / GPU14.47 GB of 15.47 GB

headroom after load17 MB (need 20 MB)

NVFP4 via TRT-LLM 1.2.1 is hardware-limited on 2× RTX 5060 Ti 16GB. Standard TP=2 fails because Mamba/SSM layers cannot be tensor-parallelised — they replicate on each GPU rather than split. The correct approach (NVIDIA's official ep/bmm sharding via trtllm-serve --backend _autodeploy) loads 14.47 GB of weights onto GPU 0 (Mamba layers replicated + sharded experts), leaving only 17 MB headroom. TRT-LLM needs 20 MB to complete initialization. 3 MB gap, no configuration fix possible. Requires ≥20 GB VRAM per card.

NVFP4TRT-LLMhardware-limited

Optimization Deltas — Summary

Generation throughput improvement from baseline across all optimization campaigns. Bars normalized to the largest observed delta (+17% DiffusionGemma 26B kv-cache flag). Negative deltas not shown; failed/OOM campaigns listed separately.

Best improvement delta per campaign (%)

DiffusionGemma 26B (kv-cache flag)

+17%

SuperGemma4-26B (llama.cpp)

+15.2%

PRISM-PRO 27B (llama.cpp MTP)

+13.0%

Qwen 3.6-27B Pass 2 (vLLM)

+8.4%

Genesis restoration (config hardening)

+8% vs prior

Qwen 3.6-27B Pass 1 (vLLM)

+6.7%

AEON NVFP4 (vLLM)

+1.3%

GLM-4.7-Flash (vLLM)

+1.3%

Qwen MoE 35B (llama.cpp)

Llama 3.3 70B (llama.cpp)

OOM ×10

Gemma 4 12B (19 experiments)

0% — flat

Full Experiment Log — Pass 2 Detail (Qwen 3.6-27B vLLM)

Complete 13-experiment run showing exact variables, deltas, and outcomes. Representative of methodology applied across all campaigns.

#	Variable	Value	t/s (median)	Delta	Outcome
0	Baseline (Pass 1 canonical, fp16 KV)	—	74.35	±0.00	BASELINE
1	`VLLM_MARLIN_USE_ATOMIC_ADD`	`1`	80.59	+6.25	IMPROVEMENT
2	`VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE`	`1073741824`	73.57	−0.78	NO CHANGE
3	`VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE`	`805306368`	74.92	−0.57	NO CHANGE
4	`OMP_NUM_THREADS`	`2`	80.14	−0.46	NO CHANGE
5	`NCCL_BUFFSIZE`	`8388608`	80.99	+0.40	MARGINAL
6	`NCCL_BUFFSIZE`	`16777216`	81.65	+1.06	MARGINAL
7	`PYTORCH_MAX_SPLIT_MB`	`64`	81.18	+0.58	MARGINAL
8	`PYTORCH_MAX_SPLIT_MB`	`128`	74.44	−6.15	NO CHANGE
9	OMP4 + exclusive buffer (stacked)	combined	80.54	−0.06	NO CHANGE
10	OMP4 + exclusive + GMU 0.88 (triple stack)	combined	79.15	−1.45	NO CHANGE
11	ctx 16K + GMU 0.88	combined	79.11	−1.48	NO CHANGE
12	Marlin atomic + OMP4 + exclusive	combined	80.02	−0.58	NO CHANGE
final	Best config re-verified	—	80.48	+6.13	FINAL

Blackwell SM_120 Research Findings

Operational findings specific to NVIDIA Blackwell SM_120 consumer GPU inference. These characterize the platform envelope for first-generation Blackwell consumer hardware — a dataset that is sparse in the current literature as most published research covers datacenter-class (H100, A100) or older consumer generations (Ampere, Ada).

CUDA kernel paths SM_120 fast path

Fast CUDA kernel paths on SM_120 exist for q4_0 and f16 KV cache only. q8_0, q5_0, and iq4_nl all degrade significantly — they fall back to unoptimized paths with 15–30% throughput loss. This is a first-generation Blackwell constraint; future driver updates may expand coverage. Choosing the wrong KV dtype is one of the most common configuration errors on this platform. SM_120KV dtype selection

NVFP4 quantization Blackwell-native format

NVFP4 (NVIDIA ModelOpt format) is the Blackwell-native quantization standard. It provides 4× smaller KV footprint vs bf16 and enables 122K+ context on 32 GB VRAM. Benchmark result: 69 t/s on AEON (Qwen 3-27B NVFP4) running cleanly on CUDA 12.8 via vLLM 0.19.2rc1 — optimization headroom is minimal because the baseline is already near-hardware-optimal. Previous blocker (May 2026): Qwen3-series NVFP4 models (30B-A3B, 32B) were blocked — they require FlashInfer's CUTLASS SM_120 FP4 GEMM kernels, which need CUDA ≥ 12.9. System ran CUDA 12.8 at the time. CUDA 13.0 appeared installed but genesis vLLM TP=2 init was crashing (later traced to flash_attn missing, not CUDA version). Status update (2026-06-06): System upgraded to CUDA 13.0.3 and genesis is now stable. CUDA version barrier for NVFP4 is cleared. Qwen3-series NVFP4 models are candidates for re-evaluation on this hardware. NVFP4ModelOpt122K contextCUDA 13.0.3

llama.cpp build target cuda128-clean required on SM_120

On SM_120 (Blackwell first-gen consumer), llama.cpp must be built with the cuda128-clean target — not cuda13. The cuda13 build crashes with ggml_cuda_mul_mat_q: invalid argument on the first generation call after prompt eval, on every model tested (Nemotron Q4_K_M, Qwen3 GGUF variants). This is a CUDA MMQ kernel incompatibility between the cuda13 build's quantized matrix-multiply path and SM_120 hardware — not a model issue. The cuda128-clean build routes around this kernel entirely and runs stably. Confirmed 2026-05-20 during Nemotron format trials. Anyone deploying llama.cpp GGUF inference on Blackwell consumer hardware should verify their build target before assuming the model is the problem. SM_120llama.cppbuild target

GPTQ-Marlin kernel atomic-add optimization

VLLM_MARLIN_USE_ATOMIC_ADD=1 enables atomic-add reduce in the gptq_marlin kernel for small-n decode on TP=2 configurations. This produced the largest single-variable improvement in the research: +6.25 t/s (8.4%) on Qwen 3.6-27B without any model or architecture change. The improvement is specific to GPTQ-quantized models running TP=2 over PCIe — it is not applicable to NVFP4 or unquantized deployments. GPTQ-MarlinTP=2

Tensor parallelism over PCIe no NVLink constraint

TP=2 over PCIe (x8 Gen5 + x4 Gen4 asymmetric config) delivers usable multi-GPU inference at 107 t/s peak on Qwen 3.6-35B MoE (April 2026 vLLM campaign). The bandwidth constraint is visible: MoE expert routing over PCIe creates a GDN bottleneck that 22 optimization experiments could not overcome — this is a fundamental PCIe-vs-NVLink architectural boundary, not a configuration issue. Dense model TP=2 scales more predictably without the MoE expert dispatch penalty. Note (2026-05-20): system peak is now 117.60 t/s via Nemotron 3 Nano 30B Q4_K_M running on llama.cpp cuda128-clean across both GPUs — a different inference stack (GGUF tensor split vs vLLM TP=2), same PCIe hardware. TP=2PCIe bottleneckMoE routing

vLLM version pinning 0.20.x incompatible

vLLM 0.20.2 fails on SM_120 TP=2 with NCCL crash at multi-GPU initialization. SM_120 multi-GPU support was not present in 0.20.x as of May 2026. Operating on 0.19.2rc1 (Genesis patches) which includes SM_120-specific fixes. Kernel 6.17.0-23 upgrade required manual GRUB pinning during the incident. The 0.20 incident resulted in 10% throughput degradation (80.8 → 71.4 t/s) pending full re-optimization pass. This documents a real operational risk for anyone deploying Blackwell consumer GPUs with multi-GPU vLLM. SM_120vLLM pinning

Context scaling fp8 KV enables 128K

fp8 KV cache on Qwen 3.6-27B enables 128K context window with <1% throughput degradation (80.65 t/s at 32K → 80.04 t/s at 128K, −0.8%). This is the most cost-effective capability expansion found in the research — 4× context length at negligible throughput cost. KV footprint scales linearly with context; fp8 halves the cost vs bf16. At 32 GB VRAM, 128K is the practical ceiling without further quantization of the KV cache itself. fp8 KV128K context

Apple Silicon / Metal pre-M5 inference constraints

With full GPU offload (-ngl 99) on M4, t=1 is strictly optimal — extra CPU threads fight over unified memory bandwidth without contributing to GPU compute. Flash attention is supported on M4 Metal (MTLGPUFamilyApple9) and gives +9–10% on both pp and tg. Metal tensor API is disabled on pre-M5 hardware: llama.cpp logs tensor API disabled for pre-M5 and pre-A19 devices — this blocks the fast tensor execution path and caps generation at ~21 t/s on 7B dense models. MLX is not a drop-in replacement: despite being Apple-native, it is 3× slower on prefill than llama.cpp Metal for Q4_K_M weights — marginal generation advantage does not compensate. Fanless thermal throttle is a hard constraint: sustained benchmarking degrades pp from 209 t/s (cold) to 152 t/s (warm, 20 min) — a 24% loss. First inference session each day is significantly faster than sustained use. KV V-cache quantization is blocked on Metal flash-attn: any V-quantized configuration fails completely; K-only quantization is marginal (+0 to +3%). Established 2026-05-22 on MacBook Air M4, llama.cpp 9270 (Homebrew), macOS 15. MetalApple Siliconfanless thermalpre-M5

70B hardware boundary 32 GB VRAM limit

Llama 3.3 70B at Q4_0 requires ~34 GB VRAM — 2 GB over the 32 GB TP=2 ceiling when Flash attention overhead is included. The model loads and runs at 55.6 t/s with the current split but is unoptimizable: all 10 experiments that adjusted context size or KV dtype triggered OOM. This characterizes the practical 70B boundary for dual RTX 5060 Ti deployment. Sub-Q4 quantization (Q3_K, Q2_K) would fit but at significant quality cost. 70B boundaryVRAM ceiling

vLLM genesis — SM_120 operational constraints flash_attn, service isolation, timeout

Three non-obvious requirements for stable vLLM genesis operation on SM_120 (Blackwell consumer). (1) flash_attn must be explicitly installed in the vllm virtual environment — it is not pulled as a vLLM dependency on CUDA 13.0 builds. Missing it crashes with ModuleNotFoundError: No module named 'flash_attn.ops'. Build requires CUDA_HOME=/usr/local/cuda-13.0 and --no-build-isolation; flash-attn 2.8.3 supports torch 2.11.0+cu130. (2) Port isolation is mandatory: any co-resident inference service (llama.cpp, another vLLM) must have Restart=no in its systemd unit. A service with Restart=always will re-claim the inference port within seconds of a genesis stop — causing a phantom "address in use" crash loop that looks like a genesis bug. (3) TimeoutStartSec=300 is required in the systemd unit file. The default (90s) is shorter than the model load time — systemd kills the process mid-load with no error other than "start operation timed out." Absent this, genesis appears to hang when it is actually loading normally. SM_120genesisflash_attnsystemd

Block-diffusion canvas scaling DiffusionGemma, 2026-06-10

DiffusionGemma 26B (Google DeepMind) introduced a fundamentally different generation architecture to the benchmark suite. Instead of autoregressive token-by-token generation, the model runs iterative denoising passes over a canvas of tokens in parallel, finalizing tokens whose entropy drops below a confidence threshold and remasking the rest. Each forward pass processes the entire canvas; early stopping triggers when the pass meets the convergence criterion. The per-step cost is nearly constant regardless of canvas size, which means requesting longer outputs yields proportionally more tokens per pass. Measured result: n=256 at 48 tok/s (22 steps), n=1024 at 121 tok/s (34 steps across 2 auto-scaled canvas blocks) — 2.5× throughput for 4× the output, with the same 249ms/step cost. This canvas-parallelism property has no analog in autoregressive inference; throughput figures for block-diffusion models must specify output length to be meaningful. Dual-GPU default flag issue: --diffusion-kv-cache is auto-disabled on 2-GPU splits in PR #24423 as a conservative default. Forcing --diffusion-kv-cache on is safe on Blackwell TP=2 and delivers +17% throughput with fewer steps (25 → 22) — the largest single-flag gain found in this research. Single-GPU path: Q4_K_M requires 16.0 GB; RTX 5060 Ti has 15.7 GB free after driver overhead. 300 MB gap prevents single-GPU load. Single-GPU would enable Flash Attention + KV cache simultaneously — estimated 20–30% further gain. Requires Q3_K_M, which needs the BF16 GGUF (~50 GB) as quantization source. block diffusioncanvas scalingDiffusionGemma

Inference engine readiness new model evaluation methodology

Evaluating a newly released model now requires testing inference engine support before any optimization work. Gemma 4 12B (2026-06-03) established this pattern: llama.cpp b9245 serves the Q8_0 GGUF cleanly at 30.9 t/s. vLLM 0.21.0 and SGLang 0.5.12 both lack native Gemma4UnifiedForConditionalGeneration support — they fall back to a generic Transformers wrapper that (a) carries 230 MiB of overhead that blocks fp8 single-GPU loading by exactly 114 MiB, and (b) does not support tensor parallelism, making TP=2 fail on weight reshape. The fp8 miss is a near-miss: the model loads to 14.89 GiB on a 15.47 GiB GPU before the profile run OOMs. Native engine support lands within weeks of a model release; the evaluation cadence should be: test llama.cpp first (always works), then revisit vLLM/SGLang after the next PyPI release. The HF model artifacts are already on disk and require only a service restart when engine support ships. new modelengine readinessTransformers fallback

Pre-Tower Reference — CPU-Only Inference

Historical baseline from cha0tikhome (Beelink EQI12, i5-1235U, 32 GB DDR4). Establishes the starting point before the RTX 5060 Ti tower arrival April 17, 2026. CPU inference ran production workloads for the first phase of the research.

10.44

tok/s CPU baseline

Gemma 4 26B Q5_K_M

4–8 GB

swap usage

chronic; co-located agents

8×

GPU speedup (vs CPU)

10.4 → 83+ t/s

17 days

CPU inference sample period

Apr 2–18, 2026

Metric	CPU Era Value	GPU Era Value	Ratio
Gen throughput	10.44 t/s (avg)	83–117 t/s	8–11×
Throughput range	5.02–11.72 t/s	71–117 t/s	7–23×
Time to first token	616–2,224 ms	<200 ms	3–11×
Swap usage	4.6–7.7 GB chronic	0 GB	eliminated
Model size supported	≤26B (VRAM-limited)	≤70B (at Q4)	2.7× parameter scale
Context window	8–16K (swap risk)	128K (fp8 KV)	8–16× context

E-core threading was a key CPU-era finding: on the i5-1235U, the E-cores (cpu_mask=0x3F5) are load-bearing for MoE expert dispatch. Restricting to P-cores only dropped throughput from 10.44 to 7.73 t/s. Optimal: 8 threads, all E+P. This contradicts the common assumption that P-cores dominate inference workloads.

Apple Silicon Baseline — MacBook Air M4 (Metal)

Reference benchmark for consumer Apple Silicon. MacBook Air M4, 16 GB unified memory, llama.cpp b200 (Homebrew) via Metal backend. Establishes the ARM/Metal data point for comparing inference efficiency across hardware tiers. Model: Qwen2.5-7B-Instruct Q4_K_M (4.36 GiB, llama-bench standard test). Run: 2026-05-22.

207

tok/s prefill (pp512)

MTL,BLAS · Flash Attn on

21.4

tok/s generation (tg128)

t=1 · -ngl 99 (all layers GPU)

3.4×

tower advantage (tg)

21.4 → 73 t/s (RTX 5060 Ti)

Apple M4

GPU family

MTLGPUFamilyApple9 · 16 GB unified

Test	M4 MacBook Air	RTX 5060 Ti (tower)	Ratio
pp512	207.39 ± 3.61 t/s	~900–1,100 t/s	4–5×
tg128	21.37 ± 0.13 t/s	73 t/s (dense Q4)	3.4×
Model	Qwen2.5-7B Q4_K_M	Nemotron 30B Q4_K_M	—
Memory	16 GB unified	16 GB GDDR7 (single GPU)	—
Backend	Metal (MTL,BLAS)	CUDA 13.0.3	—
Tensor ops	disabled (pre-M5)	full CUDA graph	—
Power envelope	~15 W TDP	~165 W TDP	11× difference

Metal tensor API constraint: llama.cpp logs tensor API disabled for pre-M5 and pre-A19 devices on M4. The Metal tensor execution path (introduced for M5/A19) is not available — Metal Shader Language kernels run the standard path. This is the primary reason M4 generation throughput tops out at ~21 t/s for a 7B dense model, vs the M5 which unlocks the tensor path. Upgrading to an M5 MacBook would push this to ~30+ t/s on the same model. The prefill rate (207 t/s) reflects Metal compute shader efficiency, which is not blocked by this constraint.

Apple Silicon — Mac mini M4, MLX (2026-07-16)

Second Apple Silicon data point, different machine and framework than the MacBook Air baseline above: Mac mini M4, 16 GB unified memory, MLX (not llama.cpp) — Apple's own framework, which generally beats llama.cpp's Metal backend on M-series unified memory. Quant hunt across four candidates on the same 15-scenario domain-expert suite (logistics/traffic/restaurants) used elsewhere on this page, judged directly against the suite's rubrics.

18.66

tok/s — Ornith-1.0-9B-4bit

live default · smallest model tested

3.00

domain-suite quality (/5)

Ornith-9B · beats both 27B candidates below

20.33

tok/s — Bonsai-27B, 1.125 bpw

fastest of the four · lowest quality (2.47)

2 of 3

failures shared identically

both Bonsai builds, independent of bit-width

Model	Recipe	Quality (/5)	Speed	Verdict
genesis (tower, reference)	Qwen3.6-27B GPTQ INT4	3.73	~97 t/s (dedicated GPU)	same base as Bonsai, light quant — clean reference point
Ornith-1.0-9B-4bit	MLX uniform 4-bit, 9B	3.00	18.66 t/s	live default
Ternary-Bonsai-27B	Prism ML ternary, 1.71 bpw	2.93	12.16 t/s	not adopted
Bonsai-27B	Prism ML binary, 1.125 bpw	2.47	20.33 t/s	not adopted despite raw speed

Fluent-and-wrong, not slow-or-dumb, is the risk that mattered here. Both Bonsai builds fail the same two scenarios identically regardless of bit-width — a rail-crossing signal/roundabout trap, and Hours-of-Service clock math where both run out of token budget mid-reasoning with no answer, ever. Genesis, sharing Bonsai's exact base model but quantized via GPTQ INT4 instead of Prism ML's ternary/binary method, clears both cleanly — pointing at the compression method, not the shared parent's training. A third scenario (an unconditional written zero-cross-contact guarantee for a severe allergy, worse in the higher-bit build than the lower-bit one) is weaker evidence: Genesis stumbles there too, so it reads as a hard trap in the suite rather than a Bonsai-specific defect. Neither Bonsai build goes live on speed alone.

Data & Artifacts

All benchmark data is public. Experiment logs, TSV results files, autoresearch scripts, and optimal configuration shell scripts are in the inference-research repository.

Artifact	Location	Contents
Mac mini quant hunt receipt	model-eval/results/domain-suite-mini-quant-hunt-20260716.md	Full scorecard + failure-mode analysis, Ornith-9B vs Bonsai-27B ternary/binary vs genesis reference
inference-research repo	randomchaos7800-hub/inference-research	All autoresearch scripts, TSV results, optimal config scripts, research logs, incident reports
autoresearch scripts	`autoresearch-*.py`	9 LLM-driven optimization loops — vLLM (27B, AWQ, AEON, GLM4), llama.cpp (70B, SuperGemma, MoE, Beelink)
M4 autoresearch log	`autoresearch-qwen25-7b-m4-log.md`	Full 25-experiment sweep log: thread sweep, flash-attn, MLX vs llama.cpp, Qwen3-8B comparison, KV cache sweep, thermal findings
M4 results TSV	`autoresearch-qwen25-7b-m4-results.tsv`	25-row experiment table: config, pp512 t/s, tg128 t/s, notes — Apple M4 MacBook Air 2026-05-22
TSV results	`autoresearch-*-results.tsv`	Per-experiment: variable, value, tg_median, tg_p90, delta, outcome, description
optimal config scripts	`current-best-flags-*.sh`	Production launch commands for all models with all optimal flags applied — includes M4 Metal config
model queue	`model-queue.md`	Current deployment state, known constraints, tested model inventory
vLLM incident report	`vllm-upgrade-incident-2026-05-10.md`	Full post-mortem: 0.19.2rc1 → 0.20.2 failure on SM_120, NCCL crash, kernel pin, recovery procedure
Research paper	papers.html	Preprint: Commodity Hardware and Persistent AI Companions — Zenodo CC BY 4.0
Full stack specs	stack.html	Hardware configuration, agent infrastructure, service topology