boundary.labs
Boundary Labs  /  Inference Research  /  NVIDIA Blackwell SM_120

Inference Benchmarks

Complete record of inference optimization research conducted on consumer NVIDIA Blackwell hardware. Systematic autoresearch loops, controlled baselines, logged experiments. All numbers real, reproducible, and sourced from production hardware.

117 tok/s peak (MoE, llama.cpp)
88 tok/s warm (vLLM genesis INT4 + MTP n=3, 2026-06-06)
210+ experiments logged
12 model families tested

All inference experiments run on the cha0tiktower inference node. Single-machine, consumer-tier Blackwell GPU — no NVLink, PCIe interconnect only. Tensor parallelism over PCIe x8+x4 Gen5/Gen4. Full specs at stack.html.

RTX 5060 Ti
GPU (×2)
Blackwell SM_120  ·  16 GB GDDR7 each
32 GB
VRAM (TP=2 combined)
GDDR7  ·  PCIe interconnect
CUDA 13.0.3
Runtime
NVFP4-native  ·  Kernel 6.17
vLLM + llama.cpp
Inference stacks
vLLM 0.19.2rc1  ·  llama.cpp b5189
SM_120 constraint context: NVIDIA Blackwell consumer GPUs (SM_120) have limited ecosystem support compared to datacenter-class A100/H100. This research specifically characterizes the SM_120 performance envelope — fast CUDA kernel paths, NVFP4 quantization behavior, and TP=2 over PCIe — to establish what is achievable on first-generation Blackwell consumer hardware.

Live inference configurations serving all production agent workloads on cha0tiktower as of 2026-05-20. All clients route through tower:8010 local proxy. Switch command: proxy-switch nemotron|deepseek-r1|aeon|openrouter.

ConfigurationQuantizationGenContextFrameworkVRAMStatus
Nemotron 3 Nano 30B (MoE hybrid) Q4_K_M 117.34 t/s med  /  117.60 peak 32K llama.cpp cuda128-clean 24.9 GB (both GPUs) active — default
Qwen3.6-27B GPTQ INT4 + MTP n=3 (Genesis) AutoRound INT4 88 t/s warm  /  63.9 avg 65K vLLM 0.21.0 + Genesis patches ~28 GB (fp8 KV, both GPUs) active — restored 2026-06-06
Qwen3.6-27B PRISM-PRO-DQ (llama.cpp, MTP n=2) DQ (GGUF) 39.30 t/s 32K llama.cpp ~28 GB research eval only — 2026-05-19
AEON (Qwen 3-27B NVFP4) NVFP4 (NVIDIA ModelOpt) ~69 t/s 122K vLLM 0.19.2rc1 ~20 GB stopped
Llama 3.3 70B Q4_0 / 60 GPU layers 55.6 t/s 65K llama.cpp ~34 GB (OOM ceiling) historical — VRAM limit
Gemma 4 26B (CPU baseline) Q4_K_M (CPU-only) 10.4–11.7 t/s 32K llama.cpp pre-tower era
2026-05-20 backend switch: Genesis (Qwen3.6-27B vLLM GPTQ) replaced by Nemotron 3 Nano 30B Q4_K_M via llama.cpp. Nemotron is a Mamba/SSM hybrid MoE architecture — 30B total parameters, ~3B active per token. New peak: 117.60 t/s. The cuda128-clean llama.cpp build is required; the cuda13 build crashes with a CUDA MMQ error (ggml_cuda_mul_mat_q: invalid argument) on first generation after the prompt eval on SM_120. This is a Blackwell SM_120 + cuda13 MMQ kernel incompatibility — not a model issue.
2026-06-06 genesis restoration: vLLM genesis brought back to production alongside Nemotron after resolving three blocking issues: (1) flash_attn.ops module missing from vllm-env — installed flash-attn 2.8.3 with CUDA 13.0 headers; (2) nemotron.service Restart=always kept seizing port 8022 after every stop — fixed with a Restart=no systemd drop-in override; (3) TimeoutStartSec absent from service file — model killed at 90s before completing load. Config upgrades applied: fp8 KV cache (halves VRAM footprint, enables 65K context), qwen3_coder tool-call parser (fixes silent drop bug), explicit SM_120 FlashInfer routing env vars, FlashInfer workspace reduced 413→256 MiB to clear OOM on first MTP inference. Stable at 88 t/s warm. Both Nemotron (117 t/s MoE) and Genesis (88 t/s dense) now available as production backends via proxy:8010.
vLLM 0.20.2 upgrade incident (2026-05-10): Upgrade from 0.19.2rc1 attempted after DFlash investigation confirmed inapplicability to Genesis (Genesis uses self-MTP, not an external draft model). vLLM 0.20.2 failed with NCCL crash on multi-GPU init — SM_120 TP=2 unsupported in 0.20.x. Rollback to 0.19.2rc1 via tarball restore. Complication: kernel upgrade to 6.17.0-23 during reboot locked out NVIDIA driver (modules only built for 6.17.0-22). Fixed by pinning GRUB default. Post-recovery genesis throughput: 71.4 t/s (13% below pre-upgrade 80.8 t/s). Re-optimization pending.

Speed is necessary but not sufficient. Two quality suites validate the production backends across capability dimensions.

Qwen3.6-27B INT4 Genesis — 29-Test Suite (2026-06-06)

Comprehensive coding/math/instruction/tool/knowledge suite run against vLLM genesis with thinking disabled (enable_thinking=false) for latency. MTP n=3 speculative decoding active. Throughput averaged across 29 varied prompts.

89.6%
overall (26/29)
3 benchmark bugs corrected
100%
instruction following
5/5 probes — perfect
100%
tool calling
4/4 — select, args, no-op
63.9
tok/s avg warm
short: 68  ·  medium: 59.7
Category pass rates
Instruction Following
5/5  (100%)
Tool Calling
4/4  (100%)
Knowledge
4/4  (100%)
Coding (HumanEval-style)
9/10  (90%)
Math / Reasoning
4/6  (67%)
Math gap is intentional: Published Qwen3.6-27B benchmarks (AIME 2026: 94.1%, GPQA Diamond: 87.8%) are measured with thinking enabled. Running with enable_thinking=false for production latency sacrifices deep reasoning chains. The 2 remaining math failures (GSM8K arithmetic, sum-of-multiples) are model errors in no-think mode — not hardware or quantization artifacts. For coding, tool calling, and instruction following the model is production-ready.

Nemotron 3 Nano 30B — 25-Probe Suite (2026-05-20)

A curated quality suite validates the current production model across five capability dimensions. Each response scored 1–5 by Claude Haiku acting as judge, using per-probe rubrics with expected outputs. Run against proxy at 117 t/s.

4.84
overall score / 5.00
25 probes  ·  5 categories
5.00
reasoning
all 5 probes perfect
5.00
factuality
myths debunked  ·  arithmetic correct
4.60
coding / agent tasks
one token-limit truncation
Category scores (1–5 scale)
Reasoning
5.00 / 5
Factuality
5.00 / 5
Instruction Following
5.00 / 5
Coding
4.60 / 5
Agent Tasks
4.60 / 5
ProbeCategoryScoreResult
r1–r5 (all)Reasoning5/5 eachSyllogism, bat/ball puzzle, sequences, fallacy detection, pill timing — all perfect.
f1–f5 (all)Factuality5/5 eachGreat Wall myth, 347×28 arithmetic, Canberra capital, 10% brain myth, Berlin Wall year — all correct.
i1–i4Instruction Following5/5 eachExact list count, raw JSON, two-sentence summary, forbidden-word avoidance — all perfect.
i5Instruction Following5/5Exact structured format (DIFFICULTY / TIME / STEPS). Transient empty response on first query; re-queried correctly.
c1Coding3/5Vowel-counter function correct; docstring quality below rubric threshold.
c2–c5Coding5/5 eachBug identification, palindrome with all normalizations, max() one-liner, Fibonacci explanation — all perfect.
a1–a3, a5Agent Tasks5/5 eachJSON extraction, classification array, professional status update, graceful error — all correct.
a4Agent Tasks3/55-step Python project setup: steps 1–4 correct; step 5 truncated at max_tokens limit (512).
Methodology: 25 hand-crafted probes across 5 categories. Each probe includes expected behavior and a 1–5 scoring rubric. Responses scored by claude-haiku-4-5-20251001 acting as judge. The two 3/5 scores are both partial-credit results (correct output, minor defect) — zero probes failed outright. The one transient empty response (i5) resolved correctly on immediate retry. Source: randomchaos7800-hub/model-eval.

Each campaign runs an automated LLM-driven hypothesis loop — generate config variant, benchmark, evaluate stopping criterion (+5 t/s improvement threshold), iterate. Full logs and raw TSV data available at github.com/randomchaos7800-hub/inference-research.

Qwen 3.6-27B Dense — vLLM Passes 1+2
GPTQ-Marlin INT4  ·  TP=2  ·  33 experiments
baseline74.35–75.63 t/s
best result80.59 t/s
delta+8.4%
key techniqueMarlin atomic-add
VLLM_MARLIN_USE_ATOMIC_ADD=1 adds +6.25 t/s — enables atomic-add reduce in gptq_marlin kernel for small-n decode on TP=2. Single env var, zero model change. KV dtype=auto (fp16) eliminated per-layer dequant overhead (+4.95 t/s in Pass 1).
GPTQ-MarlinTP=2
SuperGemma4-26B — llama.cpp Single GPU
Gemma 4 26B  ·  RTX 5060 Ti ×1  ·  23 experiments
baseline61.7 t/s
best result71.1 t/s
delta+15.2%
key techniqueCPU layer offload + threading
Reducing CPU layers 10→4, threads 8→4, parallel=2 at E-core affinity (mask 0x3F5) extracted maximum from Alder Lake heterogeneous core topology. SWA cache contention was the primary bottleneck — eliminating CPU-side routing resolved it.
SWA cacheE-core affinity
AEON — Qwen 3-27B NVFP4
NVIDIA ModelOpt NVFP4  ·  vLLM  ·  14 experiments
baseline68.86 t/s
best result69.75 t/s
delta+1.3%
key techniquebatch=8192 (marginal)
NVFP4 is Blackwell-native quantization (CUDA 12.8+ required) — 4× smaller KV footprint vs bf16. At this quantization level the baseline is already near-optimal on SM_120. 4 of 14 runs hit timeout at extreme batch configs. Model runs cleanly at 122K context.
NVFP4SM_120-native
Qwen 3.6-27B — Context Scaling (Pass 3)
32K → 128K context  ·  18 experiments
baseline80.65 t/s @ 32K
at 128K context80.04 t/s
degradation−0.8%
context headroom4× expansion
fp8 KV cache enables 128K context on 32 GB VRAM with less than 1% throughput degradation. The KV footprint scales linearly with context; fp8 halves the cost vs bf16, making long-context workloads viable without throughput sacrifice.
fp8 KV128K context
Qwen 3.6-35B-A3B MoE — llama.cpp Dual GPU
131K context  ·  TP=2  ·  22 experiments
baseline100.24 t/s
best result100.24 t/s
delta0%
findingGDN ceiling
22 experiments, zero improvements. MoE at 100 t/s on TP=2 has hit the hardware ceiling — expert routing serializes over PCIe, and no config adjustment can eliminate that bottleneck. The GDN (gating-dispatch-normalization) loop is sequential. This is a fundamental PCIe-vs-NVLink constraint finding.
MoE routingPCIe bottleneck
GLM-4.7-Flash — vLLM Dual GPU
GLM-4.7B Flash variant  ·  TP=2  ·  ~15 experiments
baseline95.90 t/s
best result97.07 t/s
delta+1.3%
key techniqueq5_0 KV quantization
q5_0 KV outperformed both q4_0 and f16 — an atypical result. At this model scale the KV bandwidth is the binding constraint, and q5_0 hits the optimal precision-vs-compression tradeoff. Confirms that KV dtype tuning is model-architecture dependent, not universally monotonic.
KV quantFlash attention
Llama 3.3 70B — llama.cpp Dual GPU
Q4_0  ·  TP=2  ·  10 experiments
baseline55.6 t/s
best result55.6 t/s
experiments failed10/10 OOM
hardware limitVRAM ceiling
70B at Q4_0 uses ~34 GB VRAM — right at the 32 GB ceiling with Flash attention overhead. Every optimization attempt that touched context size or cache dtype triggered OOM. Baseline 55.6 t/s is usable but unoptimizable on this hardware tier. 70B+ requires more VRAM or quantization below Q4.
VRAM limit70B boundary
Gemma4 26B AWQ — vLLM Single GPU
AWQ 4-bit  ·  RTX 5060 Ti ×1  ·  5 experiments
baseline latency32.77 ms/tok
best result32.77 ms/tok
timeouts2/5
outcomediscontinued
AWQ was tested as an alternative quantization path to GPTQ-Marlin on SM_120. 2 of 5 runs hit timeout at extreme batch configs. Remaining runs returned neutral outcomes with no improvement path visible. GPTQ-Marlin was confirmed as the superior quantization approach for Blackwell.
AWQvs Marlin
Nemotron 3 Nano 30B — llama.cpp Format Trials
NVIDIA MoE hybrid  ·  30B total / ~3B active  ·  2026-05-20
winnerQ4_K_M, llama.cpp cuda128-clean
median117.34 t/s
peak117.60 t/s
smoke tests4/4 pass
VRAM24.9 GB (11.8 + 13.1 GB)
prefill1,200–2,500 t/s
Q4_K_M via llama.cpp cuda128-clean is the confirmed path. The cuda13 build fails on SM_120: CUDA MMQ kernel crash (ggml_cuda_mul_mat_q: invalid argument) after the first inference, every time. The vLLM FP8 TP=2 path also failed — wrong flag (--disable-log-requests removed in newer vLLM) and NCCL engine init failure. Q4_K_M at 117 t/s is the new production peak; current default backend via proxy :8010.
Nemotronllama.cppSM_120
Qwen3 Quantization Shootout — 5 Formats
vLLM 0.21.0 + Genesis patches  ·  TP=2  ·  2026-05-16
genesis baseline73.35 t/s (27B INT4 + MTP n=3)
Qwen3-14B FP843.54 t/s — clean load
Qwen3-30B-A3B GPTQ41.33 t/s — cpu-offload req'd
Qwen3-32B GPTQ dense5.25 t/s — PCIe ceiling
NVFP4 (×2 models)was blocked — CUDA 13.0.3 now installed
The dense 32B vs MoE 30B comparison is the key finding. Both models are 16 GB on disk; both require --cpu-offload-gb 1.0 to initialize (Marlin repacking peak creates ~200 MiB initialization overage). But throughput differs by 8×: 5.25 vs 41.33 t/s. Dense 32B: all 32B weights accessed per token, PCIe bus is saturated. MoE 30B-A3B: only ~3B active params per token, PCIe pressure is ~10× lower. NVFP4 was blocked by FlashInfer's CUTLASS SM_120 FP4 kernels requiring CUDA ≥ 12.9; this system ran 12.8 at the time. System upgraded to CUDA 13.0.3 on 2026-06-06 — CUDA version barrier cleared; Qwen3 NVFP4 on SM_120 is now a candidate for re-evaluation.
FP8GPTQ-MarlinMoE vs DenseNVFP4 blocked
PRISM-PRO 27B — MTP Speculation Sweep
Qwen3.6-27B-PRISM-PRO-DQ  ·  llama.cpp  ·  15 experiments  ·  2026-05-19
baseline (MTP n=1)34.78 t/s
winner (MTP n=2)39.30 t/s (+13.0%)
no MTP (sanity)26.25 t/s (−24.5%)
threads_420.94 t/s (−39.8%)
flash-attnFAIL — server timeout 120s
MTP n=2 is the ceiling for this model: n=3 gives only 35.13 t/s (+0.35 vs baseline), likely because the third speculative token is rejected too often to pay for itself. The no-MTP sanity check confirms MTP contributes ~8.5 t/s to the baseline score. Flash-attn failed to load — server did not come up within 120s, consistent with the SWA hybrid memory architecture (SSM layers conflict with standard flash-attn kernel path). Wall-clock throughput (37.82 t/s at MTP n=2) is close to token-gen throughput because this model's SSM decode path is not fill-dominated.
MTPspeculative decodingSSM hybrid
Qwen2.5-7B-Instruct Q4_K_M — MacBook Air M4
Metal backend (MTL,BLAS)  ·  llama.cpp 9270  ·  25 experiments  ·  2026-05-22
baseline (cold)209.79 pp / 20.91 tg
best tg (t=1, warm)21.10 t/s
MLX pp throughput70.44 t/s — rejected (3× slower)
thermal cost (20 min)−24% pp (209 → 152 t/s)
t=1 is monotonically optimal with full GPU offload (-ngl 99): GPU handles all matrix math, extra CPU threads only add unified memory bus contention. Flash-attn on: +9.2% pp, +10.2% tg. MLX rejected despite being Apple-native — pp 3× slower than llama.cpp, marginal tg gain not worth it. Qwen3-8B (MLX) rejected: 80% slower pp, 12% slower tg. Metal flash-attn V-cache quantization hard-blocked (V quant incompatible with Metal flash-attn kernel path). No source build tested — native M4 mcpu flag would likely recover 15-25%. Winning config: -ngl 99 --flash-attn on -t 1
MetalApple SiliconMLX vs llama.cpp
Nemotron 3 Nano 30B — Parameter Sweep
Q4_K_M  ·  llama.cpp cuda128-clean  ·  16 experiments  ·  2026-05-21
baseline123.1 t/s
winner123.6 t/s (ctx=65536)
delta+0.4% / free win
threads=1677 t/s, stddev 30 — unstable
flash-attn on123.6 t/s — no gain
16-experiment sweep across ctx-size, threads, cache-ram, and flash-attn. ctx=65536 (2× baseline) matches baseline throughput while doubling available context — free win, deployed to production. threads=16 caused severe instability (stddev 30, worst 45 t/s) due to Mamba SSM's sequential state dependency conflicting with hyperthread contention. flash-attn has no effect on Mamba hybrid architecture — the attention-free SSM decode path is unaffected. Model is at ceiling for this hardware tier.
Mamba hybridMoESSM
Qwen3.6-27B Genesis — Production Restoration
vLLM 0.21.0 + Genesis patches  ·  fp8 KV + MTP n=3  ·  2026-06-06
prev best (pre-incident)80.8 t/s
restored warm throughput88 t/s
avg across 29 benchmarks63.9 t/s
quality (no-think)89.6% (26/29)
context window65K (fp8 KV)
Three blockers cleared: flash_attn 2.8.3 installed (cuda13 headers), nemotron.service port seizure stopped (Restart=no drop-in), TimeoutStartSec=300 added to prevent 90s systemd kill during model load. Config hardened: fp8 KV cache, qwen3_coder tool parser (fixes silent tool-drop bug), SM_120 FlashInfer routing env vars, FlashInfer workspace 413→256 MiB (OOM fix on first MTP inference). Benchmark data: 100% instruction following, 100% tool calling, 90% coding. Math at 67% without thinking — expected; published numbers require <think> chains.
genesisfp8 KVMTP n=3GPTQ-Marlin
Gemma 4 12B IT — New Model Evaluation
Q8_0 GGUF (12.7 GB)  ·  llama.cpp + vLLM + SGLang  ·  19 experiments  ·  2026-06-03
baseline (llama.cpp)30.9 t/s
best result30.9 t/s
delta (19 experiments)0% — fully flat
split-mode row−10%
split-mode tensorCRASH (b9245 bug)
vLLM 0.21.0OOM — 114 MiB short
SGLang 0.5.12arch not recognized
Dense 12B at Q8_0 is pure VRAM bandwidth bound. 19 autoresearch experiments across split modes, threads 8–20, ubatch 512–4096, and KV quant measured flat at 30.9 t/s ±0.2. split-mode tensor crashes on Gemma 4 in llama.cpp b9245; row split is −10% from PCIe sync overhead. vLLM 0.21.0 and SGLang 0.5.12 have no native Gemma4UnifiedForConditionalGeneration implementation — both fall to the Transformers fallback, which carries ~230 MiB overhead that pushes fp8 single-GPU loading over the 15.47 GiB GPU limit by exactly 114 MiB. TP=2 via Transformers fallback also crashes (tensor reshape failure after weight sharding). llama.cpp handles it cleanly at 30.9 t/s. That is the ceiling until native engine support ships.
new modelbandwidth boundecosystem gap
DiffusionGemma 26B A4B IT — Block Diffusion Debut
Google DeepMind  ·  Q4_K_M GGUF (16 GB)  ·  llama.cpp PR #24423  ·  2026-06-10
baseline (n=256, dual GPU)41 tok/s  ·  248ms/step  ·  25 steps
kv_cache on (best flag)48 tok/s  ·  22 steps  (+17%)
n=1024 + kv + alg2121 tok/s (canvas scaling)
algorithm sweep (alg0–4)margin-based ≈ confidence-based
single GPUblocked — Q4KM 300 MB over 16 GB limit
Block-diffusion architecture: iterative canvas denoising over 256-token blocks in parallel — not autoregressive. Requires llama.cpp PR #24423 (not in main; current master returns unknown model architecture: diffusion-gemma). Build target: CUDA 12.8 / sm_120 cuda128-clean. Key flag: --diffusion-kv-cache on is auto-disabled on 2-GPU splits as a conservative default — forcing it on yields +17% and reduces step count from 25 → 22 with no quality degradation observed. The architecture's native strength is longer outputs: per-step cost is nearly constant regardless of canvas size, so n=1024 delivers 2.5× the throughput of n=256. Single-GPU blocked by 300 MB headroom gap (Q4KM needs 16.0 GB, card has 15.7 GB free) — would unlock Flash Attention + KV together for an estimated further 20–30% gain.
block diffusionnew architecturecanvas scaling
Nemotron 3 Nano 30B NVFP4 — TRT-LLM Benchmark
NVFP4  ·  TensorRT-LLM 1.2.1  ·  hardware-limited  ·  2026-05-21
llama.cpp baseline123.6 t/s
TRT-LLM resultDID NOT RUN — OOM
model weights / GPU14.47 GB of 15.47 GB
headroom after load17 MB (need 20 MB)
NVFP4 via TRT-LLM 1.2.1 is hardware-limited on 2× RTX 5060 Ti 16GB. Standard TP=2 fails because Mamba/SSM layers cannot be tensor-parallelised — they replicate on each GPU rather than split. The correct approach (NVIDIA's official ep/bmm sharding via trtllm-serve --backend _autodeploy) loads 14.47 GB of weights onto GPU 0 (Mamba layers replicated + sharded experts), leaving only 17 MB headroom. TRT-LLM needs 20 MB to complete initialization. 3 MB gap, no configuration fix possible. Requires ≥20 GB VRAM per card.
NVFP4TRT-LLMhardware-limited

Generation throughput improvement from baseline across all optimization campaigns. Bars normalized to the largest observed delta (+17% DiffusionGemma 26B kv-cache flag). Negative deltas not shown; failed/OOM campaigns listed separately.

Best improvement delta per campaign (%)
DiffusionGemma 26B  (kv-cache flag)
+17%
SuperGemma4-26B  (llama.cpp)
+15.2%
PRISM-PRO 27B  (llama.cpp MTP)
+13.0%
Qwen 3.6-27B Pass 2  (vLLM)
+8.4%
Genesis restoration  (config hardening)
+8% vs prior
Qwen 3.6-27B Pass 1  (vLLM)
+6.7%
AEON NVFP4  (vLLM)
+1.3%
GLM-4.7-Flash  (vLLM)
+1.3%
Qwen MoE 35B  (llama.cpp)
0%
Llama 3.3 70B  (llama.cpp)
OOM ×10
Gemma 4 12B  (19 experiments)
0% — flat

Full Experiment Log — Pass 2 Detail (Qwen 3.6-27B vLLM)

Complete 13-experiment run showing exact variables, deltas, and outcomes. Representative of methodology applied across all campaigns.

#VariableValuet/s (median)DeltaOutcome
0 Baseline (Pass 1 canonical, fp16 KV) 74.35 ±0.00 BASELINE
1 VLLM_MARLIN_USE_ATOMIC_ADD 1 80.59 +6.25 IMPROVEMENT
2 VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE 1073741824 73.57 −0.78 NO CHANGE
3 VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE 805306368 74.92 −0.57 NO CHANGE
4 OMP_NUM_THREADS 2 80.14 −0.46 NO CHANGE
5 NCCL_BUFFSIZE 8388608 80.99 +0.40 MARGINAL
6 NCCL_BUFFSIZE 16777216 81.65 +1.06 MARGINAL
7 PYTORCH_MAX_SPLIT_MB 64 81.18 +0.58 MARGINAL
8 PYTORCH_MAX_SPLIT_MB 128 74.44 −6.15 NO CHANGE
9 OMP4 + exclusive buffer (stacked) combined 80.54 −0.06 NO CHANGE
10 OMP4 + exclusive + GMU 0.88 (triple stack) combined 79.15 −1.45 NO CHANGE
11 ctx 16K + GMU 0.88 combined 79.11 −1.48 NO CHANGE
12 Marlin atomic + OMP4 + exclusive combined 80.02 −0.58 NO CHANGE
final Best config re-verified 80.48 +6.13 FINAL

Operational findings specific to NVIDIA Blackwell SM_120 consumer GPU inference. These characterize the platform envelope for first-generation Blackwell consumer hardware — a dataset that is sparse in the current literature as most published research covers datacenter-class (H100, A100) or older consumer generations (Ampere, Ada).

CUDA kernel paths SM_120 fast path
Fast CUDA kernel paths on SM_120 exist for q4_0 and f16 KV cache only. q8_0, q5_0, and iq4_nl all degrade significantly — they fall back to unoptimized paths with 15–30% throughput loss. This is a first-generation Blackwell constraint; future driver updates may expand coverage. Choosing the wrong KV dtype is one of the most common configuration errors on this platform. SM_120KV dtype selection
NVFP4 quantization Blackwell-native format
NVFP4 (NVIDIA ModelOpt format) is the Blackwell-native quantization standard. It provides 4× smaller KV footprint vs bf16 and enables 122K+ context on 32 GB VRAM. Benchmark result: 69 t/s on AEON (Qwen 3-27B NVFP4) running cleanly on CUDA 12.8 via vLLM 0.19.2rc1 — optimization headroom is minimal because the baseline is already near-hardware-optimal. Previous blocker (May 2026): Qwen3-series NVFP4 models (30B-A3B, 32B) were blocked — they require FlashInfer's CUTLASS SM_120 FP4 GEMM kernels, which need CUDA ≥ 12.9. System ran CUDA 12.8 at the time. CUDA 13.0 appeared installed but genesis vLLM TP=2 init was crashing (later traced to flash_attn missing, not CUDA version). Status update (2026-06-06): System upgraded to CUDA 13.0.3 and genesis is now stable. CUDA version barrier for NVFP4 is cleared. Qwen3-series NVFP4 models are candidates for re-evaluation on this hardware. NVFP4ModelOpt122K contextCUDA 13.0.3
llama.cpp build target cuda128-clean required on SM_120
On SM_120 (Blackwell first-gen consumer), llama.cpp must be built with the cuda128-clean target — not cuda13. The cuda13 build crashes with ggml_cuda_mul_mat_q: invalid argument on the first generation call after prompt eval, on every model tested (Nemotron Q4_K_M, Qwen3 GGUF variants). This is a CUDA MMQ kernel incompatibility between the cuda13 build's quantized matrix-multiply path and SM_120 hardware — not a model issue. The cuda128-clean build routes around this kernel entirely and runs stably. Confirmed 2026-05-20 during Nemotron format trials. Anyone deploying llama.cpp GGUF inference on Blackwell consumer hardware should verify their build target before assuming the model is the problem. SM_120llama.cppbuild target
GPTQ-Marlin kernel atomic-add optimization
VLLM_MARLIN_USE_ATOMIC_ADD=1 enables atomic-add reduce in the gptq_marlin kernel for small-n decode on TP=2 configurations. This produced the largest single-variable improvement in the research: +6.25 t/s (8.4%) on Qwen 3.6-27B without any model or architecture change. The improvement is specific to GPTQ-quantized models running TP=2 over PCIe — it is not applicable to NVFP4 or unquantized deployments. GPTQ-MarlinTP=2
Tensor parallelism over PCIe no NVLink constraint
TP=2 over PCIe (x8 Gen5 + x4 Gen4 asymmetric config) delivers usable multi-GPU inference at 107 t/s peak on Qwen 3.6-35B MoE (April 2026 vLLM campaign). The bandwidth constraint is visible: MoE expert routing over PCIe creates a GDN bottleneck that 22 optimization experiments could not overcome — this is a fundamental PCIe-vs-NVLink architectural boundary, not a configuration issue. Dense model TP=2 scales more predictably without the MoE expert dispatch penalty. Note (2026-05-20): system peak is now 117.60 t/s via Nemotron 3 Nano 30B Q4_K_M running on llama.cpp cuda128-clean across both GPUs — a different inference stack (GGUF tensor split vs vLLM TP=2), same PCIe hardware. TP=2PCIe bottleneckMoE routing
vLLM version pinning 0.20.x incompatible
vLLM 0.20.2 fails on SM_120 TP=2 with NCCL crash at multi-GPU initialization. SM_120 multi-GPU support was not present in 0.20.x as of May 2026. Operating on 0.19.2rc1 (Genesis patches) which includes SM_120-specific fixes. Kernel 6.17.0-23 upgrade required manual GRUB pinning during the incident. The 0.20 incident resulted in 10% throughput degradation (80.8 → 71.4 t/s) pending full re-optimization pass. This documents a real operational risk for anyone deploying Blackwell consumer GPUs with multi-GPU vLLM. SM_120vLLM pinning
Context scaling fp8 KV enables 128K
fp8 KV cache on Qwen 3.6-27B enables 128K context window with <1% throughput degradation (80.65 t/s at 32K → 80.04 t/s at 128K, −0.8%). This is the most cost-effective capability expansion found in the research — 4× context length at negligible throughput cost. KV footprint scales linearly with context; fp8 halves the cost vs bf16. At 32 GB VRAM, 128K is the practical ceiling without further quantization of the KV cache itself. fp8 KV128K context
Apple Silicon / Metal pre-M5 inference constraints
With full GPU offload (-ngl 99) on M4, t=1 is strictly optimal — extra CPU threads fight over unified memory bandwidth without contributing to GPU compute. Flash attention is supported on M4 Metal (MTLGPUFamilyApple9) and gives +9–10% on both pp and tg. Metal tensor API is disabled on pre-M5 hardware: llama.cpp logs tensor API disabled for pre-M5 and pre-A19 devices — this blocks the fast tensor execution path and caps generation at ~21 t/s on 7B dense models. MLX is not a drop-in replacement: despite being Apple-native, it is 3× slower on prefill than llama.cpp Metal for Q4_K_M weights — marginal generation advantage does not compensate. Fanless thermal throttle is a hard constraint: sustained benchmarking degrades pp from 209 t/s (cold) to 152 t/s (warm, 20 min) — a 24% loss. First inference session each day is significantly faster than sustained use. KV V-cache quantization is blocked on Metal flash-attn: any V-quantized configuration fails completely; K-only quantization is marginal (+0 to +3%). Established 2026-05-22 on MacBook Air M4, llama.cpp 9270 (Homebrew), macOS 15. MetalApple Siliconfanless thermalpre-M5
70B hardware boundary 32 GB VRAM limit
Llama 3.3 70B at Q4_0 requires ~34 GB VRAM — 2 GB over the 32 GB TP=2 ceiling when Flash attention overhead is included. The model loads and runs at 55.6 t/s with the current split but is unoptimizable: all 10 experiments that adjusted context size or KV dtype triggered OOM. This characterizes the practical 70B boundary for dual RTX 5060 Ti deployment. Sub-Q4 quantization (Q3_K, Q2_K) would fit but at significant quality cost. 70B boundaryVRAM ceiling
vLLM genesis — SM_120 operational constraints flash_attn, service isolation, timeout
Three non-obvious requirements for stable vLLM genesis operation on SM_120 (Blackwell consumer). (1) flash_attn must be explicitly installed in the vllm virtual environment — it is not pulled as a vLLM dependency on CUDA 13.0 builds. Missing it crashes with ModuleNotFoundError: No module named 'flash_attn.ops'. Build requires CUDA_HOME=/usr/local/cuda-13.0 and --no-build-isolation; flash-attn 2.8.3 supports torch 2.11.0+cu130. (2) Port isolation is mandatory: any co-resident inference service (llama.cpp, another vLLM) must have Restart=no in its systemd unit. A service with Restart=always will re-claim the inference port within seconds of a genesis stop — causing a phantom "address in use" crash loop that looks like a genesis bug. (3) TimeoutStartSec=300 is required in the systemd unit file. The default (90s) is shorter than the model load time — systemd kills the process mid-load with no error other than "start operation timed out." Absent this, genesis appears to hang when it is actually loading normally. SM_120genesisflash_attnsystemd
Block-diffusion canvas scaling DiffusionGemma, 2026-06-10
DiffusionGemma 26B (Google DeepMind) introduced a fundamentally different generation architecture to the benchmark suite. Instead of autoregressive token-by-token generation, the model runs iterative denoising passes over a canvas of tokens in parallel, finalizing tokens whose entropy drops below a confidence threshold and remasking the rest. Each forward pass processes the entire canvas; early stopping triggers when the pass meets the convergence criterion. The per-step cost is nearly constant regardless of canvas size, which means requesting longer outputs yields proportionally more tokens per pass. Measured result: n=256 at 48 tok/s (22 steps), n=1024 at 121 tok/s (34 steps across 2 auto-scaled canvas blocks) — 2.5× throughput for 4× the output, with the same 249ms/step cost. This canvas-parallelism property has no analog in autoregressive inference; throughput figures for block-diffusion models must specify output length to be meaningful. Dual-GPU default flag issue: --diffusion-kv-cache is auto-disabled on 2-GPU splits in PR #24423 as a conservative default. Forcing --diffusion-kv-cache on is safe on Blackwell TP=2 and delivers +17% throughput with fewer steps (25 → 22) — the largest single-flag gain found in this research. Single-GPU path: Q4_K_M requires 16.0 GB; RTX 5060 Ti has 15.7 GB free after driver overhead. 300 MB gap prevents single-GPU load. Single-GPU would enable Flash Attention + KV cache simultaneously — estimated 20–30% further gain. Requires Q3_K_M, which needs the BF16 GGUF (~50 GB) as quantization source. block diffusioncanvas scalingDiffusionGemma
Inference engine readiness new model evaluation methodology
Evaluating a newly released model now requires testing inference engine support before any optimization work. Gemma 4 12B (2026-06-03) established this pattern: llama.cpp b9245 serves the Q8_0 GGUF cleanly at 30.9 t/s. vLLM 0.21.0 and SGLang 0.5.12 both lack native Gemma4UnifiedForConditionalGeneration support — they fall back to a generic Transformers wrapper that (a) carries 230 MiB of overhead that blocks fp8 single-GPU loading by exactly 114 MiB, and (b) does not support tensor parallelism, making TP=2 fail on weight reshape. The fp8 miss is a near-miss: the model loads to 14.89 GiB on a 15.47 GiB GPU before the profile run OOMs. Native engine support lands within weeks of a model release; the evaluation cadence should be: test llama.cpp first (always works), then revisit vLLM/SGLang after the next PyPI release. The HF model artifacts are already on disk and require only a service restart when engine support ships. new modelengine readinessTransformers fallback

Historical baseline from cha0tikhome (Beelink EQI12, i5-1235U, 32 GB DDR4). Establishes the starting point before the RTX 5060 Ti tower arrival April 17, 2026. CPU inference ran production workloads for the first phase of the research.

10.44
tok/s CPU baseline
Gemma 4 26B Q5_K_M
4–8 GB
swap usage
chronic; co-located agents
GPU speedup (vs CPU)
10.4 → 83+ t/s
17 days
CPU inference sample period
Apr 2–18, 2026
MetricCPU Era ValueGPU Era ValueRatio
Gen throughput10.44 t/s (avg)83–117 t/s8–11×
Throughput range5.02–11.72 t/s71–117 t/s7–23×
Time to first token616–2,224 ms<200 ms3–11×
Swap usage4.6–7.7 GB chronic0 GBeliminated
Model size supported≤26B (VRAM-limited)≤70B (at Q4)2.7× parameter scale
Context window8–16K (swap risk)128K (fp8 KV)8–16× context
E-core threading was a key CPU-era finding: on the i5-1235U, the E-cores (cpu_mask=0x3F5) are load-bearing for MoE expert dispatch. Restricting to P-cores only dropped throughput from 10.44 to 7.73 t/s. Optimal: 8 threads, all E+P. This contradicts the common assumption that P-cores dominate inference workloads.

Reference benchmark for consumer Apple Silicon. MacBook Air M4, 16 GB unified memory, llama.cpp b200 (Homebrew) via Metal backend. Establishes the ARM/Metal data point for comparing inference efficiency across hardware tiers. Model: Qwen2.5-7B-Instruct Q4_K_M (4.36 GiB, llama-bench standard test). Run: 2026-05-22.

207
tok/s prefill (pp512)
MTL,BLAS  ·  Flash Attn on
21.4
tok/s generation (tg128)
t=1  ·  -ngl 99 (all layers GPU)
3.4×
tower advantage (tg)
21.4 → 73 t/s (RTX 5060 Ti)
Apple M4
GPU family
MTLGPUFamilyApple9  ·  16 GB unified
TestM4 MacBook AirRTX 5060 Ti (tower)Ratio
pp512207.39 ± 3.61 t/s~900–1,100 t/s4–5×
tg12821.37 ± 0.13 t/s73 t/s (dense Q4)3.4×
ModelQwen2.5-7B Q4_K_MNemotron 30B Q4_K_M
Memory16 GB unified16 GB GDDR7 (single GPU)
BackendMetal (MTL,BLAS)CUDA 13.0.3
Tensor opsdisabled (pre-M5)full CUDA graph
Power envelope~15 W TDP~165 W TDP11× difference
Metal tensor API constraint: llama.cpp logs tensor API disabled for pre-M5 and pre-A19 devices on M4. The Metal tensor execution path (introduced for M5/A19) is not available — Metal Shader Language kernels run the standard path. This is the primary reason M4 generation throughput tops out at ~21 t/s for a 7B dense model, vs the M5 which unlocks the tensor path. Upgrading to an M5 MacBook would push this to ~30+ t/s on the same model. The prefill rate (207 t/s) reflects Metal compute shader efficiency, which is not blocked by this constraint.

All benchmark data is public. Experiment logs, TSV results files, autoresearch scripts, and optimal configuration shell scripts are in the inference-research repository.

ArtifactLocationContents
inference-research repo randomchaos7800-hub/inference-research All autoresearch scripts, TSV results, optimal config scripts, research logs, incident reports
autoresearch scripts autoresearch-*.py 9 LLM-driven optimization loops — vLLM (27B, AWQ, AEON, GLM4), llama.cpp (70B, SuperGemma, MoE, Beelink)
M4 autoresearch log autoresearch-qwen25-7b-m4-log.md Full 25-experiment sweep log: thread sweep, flash-attn, MLX vs llama.cpp, Qwen3-8B comparison, KV cache sweep, thermal findings
M4 results TSV autoresearch-qwen25-7b-m4-results.tsv 25-row experiment table: config, pp512 t/s, tg128 t/s, notes — Apple M4 MacBook Air 2026-05-22
TSV results autoresearch-*-results.tsv Per-experiment: variable, value, tg_median, tg_p90, delta, outcome, description
optimal config scripts current-best-flags-*.sh Production launch commands for all models with all optimal flags applied — includes M4 Metal config
model queue model-queue.md Current deployment state, known constraints, tested model inventory
vLLM incident report vllm-upgrade-incident-2026-05-10.md Full post-mortem: 0.19.2rc1 → 0.20.2 failure on SM_120, NCCL crash, kernel pin, recovery procedure
Research paper papers.html Preprint: Commodity Hardware and Persistent AI Companions — Zenodo CC BY 4.0
Full stack specs stack.html Hardware configuration, agent infrastructure, service topology