Complete record of inference optimization research conducted on consumer NVIDIA Blackwell hardware. Systematic autoresearch loops, controlled baselines, logged experiments. All numbers real, reproducible, and sourced from production hardware.
All inference experiments run on the cha0tiktower inference node. Single-machine, consumer-tier Blackwell GPU — no NVLink, PCIe interconnect only. Tensor parallelism over PCIe x8+x4 Gen5/Gen4. Full specs at stack.html.
Live inference configurations serving all production agent workloads on cha0tiktower as of 2026-05-20. All clients route through tower:8010 local proxy. Switch command: proxy-switch nemotron|deepseek-r1|aeon|openrouter.
| Configuration | Quantization | Gen | Context | Framework | VRAM | Status |
|---|---|---|---|---|---|---|
| Nemotron 3 Nano 30B (MoE hybrid) | Q4_K_M | 117.34 t/s med / 117.60 peak | 32K | llama.cpp cuda128-clean | 24.9 GB (both GPUs) | active — default |
| Qwen3.6-27B GPTQ INT4 + MTP n=3 (Genesis) | AutoRound INT4 | 88 t/s warm / 63.9 avg | 65K | vLLM 0.21.0 + Genesis patches | ~28 GB (fp8 KV, both GPUs) | active — restored 2026-06-06 |
| Qwen3.6-27B PRISM-PRO-DQ (llama.cpp, MTP n=2) | DQ (GGUF) | 39.30 t/s | 32K | llama.cpp | ~28 GB | research eval only — 2026-05-19 |
| AEON (Qwen 3-27B NVFP4) | NVFP4 (NVIDIA ModelOpt) | ~69 t/s | 122K | vLLM 0.19.2rc1 | ~20 GB | stopped |
| Llama 3.3 70B | Q4_0 / 60 GPU layers | 55.6 t/s | 65K | llama.cpp | ~34 GB (OOM ceiling) | historical — VRAM limit |
| Gemma 4 26B (CPU baseline) | Q4_K_M (CPU-only) | 10.4–11.7 t/s | 32K | llama.cpp | — | pre-tower era |
ggml_cuda_mul_mat_q: invalid argument) on first generation after the prompt eval on SM_120. This is a Blackwell SM_120 + cuda13 MMQ kernel incompatibility — not a model issue.flash_attn.ops module missing from vllm-env — installed flash-attn 2.8.3 with CUDA 13.0 headers; (2) nemotron.service Restart=always kept seizing port 8022 after every stop — fixed with a Restart=no systemd drop-in override; (3) TimeoutStartSec absent from service file — model killed at 90s before completing load. Config upgrades applied: fp8 KV cache (halves VRAM footprint, enables 65K context), qwen3_coder tool-call parser (fixes silent drop bug), explicit SM_120 FlashInfer routing env vars, FlashInfer workspace reduced 413→256 MiB to clear OOM on first MTP inference. Stable at 88 t/s warm. Both Nemotron (117 t/s MoE) and Genesis (88 t/s dense) now available as production backends via proxy:8010.Speed is necessary but not sufficient. Two quality suites validate the production backends across capability dimensions.
Comprehensive coding/math/instruction/tool/knowledge suite run against vLLM genesis with thinking disabled (enable_thinking=false) for latency. MTP n=3 speculative decoding active. Throughput averaged across 29 varied prompts.
enable_thinking=false for production latency sacrifices deep reasoning chains. The 2 remaining math failures (GSM8K arithmetic, sum-of-multiples) are model errors in no-think mode — not hardware or quantization artifacts. For coding, tool calling, and instruction following the model is production-ready.A curated quality suite validates the current production model across five capability dimensions. Each response scored 1–5 by Claude Haiku acting as judge, using per-probe rubrics with expected outputs. Run against proxy at 117 t/s.
| Probe | Category | Score | Result |
|---|---|---|---|
| r1–r5 (all) | Reasoning | 5/5 each | Syllogism, bat/ball puzzle, sequences, fallacy detection, pill timing — all perfect. |
| f1–f5 (all) | Factuality | 5/5 each | Great Wall myth, 347×28 arithmetic, Canberra capital, 10% brain myth, Berlin Wall year — all correct. |
| i1–i4 | Instruction Following | 5/5 each | Exact list count, raw JSON, two-sentence summary, forbidden-word avoidance — all perfect. |
| i5 | Instruction Following | 5/5 | Exact structured format (DIFFICULTY / TIME / STEPS). Transient empty response on first query; re-queried correctly. |
| c1 | Coding | 3/5 | Vowel-counter function correct; docstring quality below rubric threshold. |
| c2–c5 | Coding | 5/5 each | Bug identification, palindrome with all normalizations, max() one-liner, Fibonacci explanation — all perfect. |
| a1–a3, a5 | Agent Tasks | 5/5 each | JSON extraction, classification array, professional status update, graceful error — all correct. |
| a4 | Agent Tasks | 3/5 | 5-step Python project setup: steps 1–4 correct; step 5 truncated at max_tokens limit (512). |
claude-haiku-4-5-20251001 acting as judge. The two 3/5 scores are both partial-credit results (correct output, minor defect) — zero probes failed outright. The one transient empty response (i5) resolved correctly on immediate retry. Source: randomchaos7800-hub/model-eval.Each campaign runs an automated LLM-driven hypothesis loop — generate config variant, benchmark, evaluate stopping criterion (+5 t/s improvement threshold), iterate. Full logs and raw TSV data available at github.com/randomchaos7800-hub/inference-research.
ggml_cuda_mul_mat_q: invalid argument) after the first inference, every time. The vLLM FP8 TP=2 path also failed — wrong flag (--disable-log-requests removed in newer vLLM) and NCCL engine init failure. Q4_K_M at 117 t/s is the new production peak; current default backend via proxy :8010.--cpu-offload-gb 1.0 to initialize (Marlin repacking peak creates ~200 MiB initialization overage). But throughput differs by 8×: 5.25 vs 41.33 t/s. Dense 32B: all 32B weights accessed per token, PCIe bus is saturated. MoE 30B-A3B: only ~3B active params per token, PCIe pressure is ~10× lower. NVFP4 was blocked by FlashInfer's CUTLASS SM_120 FP4 kernels requiring CUDA ≥ 12.9; this system ran 12.8 at the time. System upgraded to CUDA 13.0.3 on 2026-06-06 — CUDA version barrier cleared; Qwen3 NVFP4 on SM_120 is now a candidate for re-evaluation.-ngl 99 --flash-attn on -t 1unknown model architecture: diffusion-gemma). Build target: CUDA 12.8 / sm_120 cuda128-clean. Key flag: --diffusion-kv-cache on is auto-disabled on 2-GPU splits as a conservative default — forcing it on yields +17% and reduces step count from 25 → 22 with no quality degradation observed. The architecture's native strength is longer outputs: per-step cost is nearly constant regardless of canvas size, so n=1024 delivers 2.5× the throughput of n=256. Single-GPU blocked by 300 MB headroom gap (Q4KM needs 16.0 GB, card has 15.7 GB free) — would unlock Flash Attention + KV together for an estimated further 20–30% gain.Generation throughput improvement from baseline across all optimization campaigns. Bars normalized to the largest observed delta (+17% DiffusionGemma 26B kv-cache flag). Negative deltas not shown; failed/OOM campaigns listed separately.
Complete 13-experiment run showing exact variables, deltas, and outcomes. Representative of methodology applied across all campaigns.
| # | Variable | Value | t/s (median) | Delta | Outcome |
|---|---|---|---|---|---|
| 0 | Baseline (Pass 1 canonical, fp16 KV) | — | 74.35 | ±0.00 | BASELINE |
| 1 | VLLM_MARLIN_USE_ATOMIC_ADD |
1 |
80.59 | +6.25 | IMPROVEMENT |
| 2 | VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE |
1073741824 |
73.57 | −0.78 | NO CHANGE |
| 3 | VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE |
805306368 |
74.92 | −0.57 | NO CHANGE |
| 4 | OMP_NUM_THREADS |
2 |
80.14 | −0.46 | NO CHANGE |
| 5 | NCCL_BUFFSIZE |
8388608 |
80.99 | +0.40 | MARGINAL |
| 6 | NCCL_BUFFSIZE |
16777216 |
81.65 | +1.06 | MARGINAL |
| 7 | PYTORCH_MAX_SPLIT_MB |
64 |
81.18 | +0.58 | MARGINAL |
| 8 | PYTORCH_MAX_SPLIT_MB |
128 |
74.44 | −6.15 | NO CHANGE |
| 9 | OMP4 + exclusive buffer (stacked) | combined | 80.54 | −0.06 | NO CHANGE |
| 10 | OMP4 + exclusive + GMU 0.88 (triple stack) | combined | 79.15 | −1.45 | NO CHANGE |
| 11 | ctx 16K + GMU 0.88 | combined | 79.11 | −1.48 | NO CHANGE |
| 12 | Marlin atomic + OMP4 + exclusive | combined | 80.02 | −0.58 | NO CHANGE |
| final | Best config re-verified | — | 80.48 | +6.13 | FINAL |
Operational findings specific to NVIDIA Blackwell SM_120 consumer GPU inference. These characterize the platform envelope for first-generation Blackwell consumer hardware — a dataset that is sparse in the current literature as most published research covers datacenter-class (H100, A100) or older consumer generations (Ampere, Ada).
q4_0 and f16 KV cache only. q8_0, q5_0, and iq4_nl all degrade significantly — they fall back to unoptimized paths with 15–30% throughput loss. This is a first-generation Blackwell constraint; future driver updates may expand coverage. Choosing the wrong KV dtype is one of the most common configuration errors on this platform.
SM_120KV dtype selection
llama.cpp must be built with the cuda128-clean target — not cuda13. The cuda13 build crashes with ggml_cuda_mul_mat_q: invalid argument on the first generation call after prompt eval, on every model tested (Nemotron Q4_K_M, Qwen3 GGUF variants). This is a CUDA MMQ kernel incompatibility between the cuda13 build's quantized matrix-multiply path and SM_120 hardware — not a model issue. The cuda128-clean build routes around this kernel entirely and runs stably. Confirmed 2026-05-20 during Nemotron format trials. Anyone deploying llama.cpp GGUF inference on Blackwell consumer hardware should verify their build target before assuming the model is the problem.
SM_120llama.cppbuild target
VLLM_MARLIN_USE_ATOMIC_ADD=1 enables atomic-add reduce in the gptq_marlin kernel for small-n decode on TP=2 configurations. This produced the largest single-variable improvement in the research: +6.25 t/s (8.4%) on Qwen 3.6-27B without any model or architecture change. The improvement is specific to GPTQ-quantized models running TP=2 over PCIe — it is not applicable to NVFP4 or unquantized deployments.
GPTQ-MarlinTP=2
-ngl 99) on M4, t=1 is strictly optimal — extra CPU threads fight over unified memory bandwidth without contributing to GPU compute. Flash attention is supported on M4 Metal (MTLGPUFamilyApple9) and gives +9–10% on both pp and tg. Metal tensor API is disabled on pre-M5 hardware: llama.cpp logs tensor API disabled for pre-M5 and pre-A19 devices — this blocks the fast tensor execution path and caps generation at ~21 t/s on 7B dense models. MLX is not a drop-in replacement: despite being Apple-native, it is 3× slower on prefill than llama.cpp Metal for Q4_K_M weights — marginal generation advantage does not compensate. Fanless thermal throttle is a hard constraint: sustained benchmarking degrades pp from 209 t/s (cold) to 152 t/s (warm, 20 min) — a 24% loss. First inference session each day is significantly faster than sustained use. KV V-cache quantization is blocked on Metal flash-attn: any V-quantized configuration fails completely; K-only quantization is marginal (+0 to +3%). Established 2026-05-22 on MacBook Air M4, llama.cpp 9270 (Homebrew), macOS 15.
MetalApple Siliconfanless thermalpre-M5
ModuleNotFoundError: No module named 'flash_attn.ops'. Build requires CUDA_HOME=/usr/local/cuda-13.0 and --no-build-isolation; flash-attn 2.8.3 supports torch 2.11.0+cu130. (2) Port isolation is mandatory: any co-resident inference service (llama.cpp, another vLLM) must have Restart=no in its systemd unit. A service with Restart=always will re-claim the inference port within seconds of a genesis stop — causing a phantom "address in use" crash loop that looks like a genesis bug. (3) TimeoutStartSec=300 is required in the systemd unit file. The default (90s) is shorter than the model load time — systemd kills the process mid-load with no error other than "start operation timed out." Absent this, genesis appears to hang when it is actually loading normally.
SM_120genesisflash_attnsystemd
--diffusion-kv-cache is auto-disabled on 2-GPU splits in PR #24423 as a conservative default. Forcing --diffusion-kv-cache on is safe on Blackwell TP=2 and delivers +17% throughput with fewer steps (25 → 22) — the largest single-flag gain found in this research. Single-GPU path: Q4_K_M requires 16.0 GB; RTX 5060 Ti has 15.7 GB free after driver overhead. 300 MB gap prevents single-GPU load. Single-GPU would enable Flash Attention + KV cache simultaneously — estimated 20–30% further gain. Requires Q3_K_M, which needs the BF16 GGUF (~50 GB) as quantization source.
block diffusioncanvas scalingDiffusionGemma
Gemma4UnifiedForConditionalGeneration support — they fall back to a generic Transformers wrapper that (a) carries 230 MiB of overhead that blocks fp8 single-GPU loading by exactly 114 MiB, and (b) does not support tensor parallelism, making TP=2 fail on weight reshape. The fp8 miss is a near-miss: the model loads to 14.89 GiB on a 15.47 GiB GPU before the profile run OOMs. Native engine support lands within weeks of a model release; the evaluation cadence should be: test llama.cpp first (always works), then revisit vLLM/SGLang after the next PyPI release. The HF model artifacts are already on disk and require only a service restart when engine support ships.
new modelengine readinessTransformers fallback
Historical baseline from cha0tikhome (Beelink EQI12, i5-1235U, 32 GB DDR4). Establishes the starting point before the RTX 5060 Ti tower arrival April 17, 2026. CPU inference ran production workloads for the first phase of the research.
| Metric | CPU Era Value | GPU Era Value | Ratio |
|---|---|---|---|
| Gen throughput | 10.44 t/s (avg) | 83–117 t/s | 8–11× |
| Throughput range | 5.02–11.72 t/s | 71–117 t/s | 7–23× |
| Time to first token | 616–2,224 ms | <200 ms | 3–11× |
| Swap usage | 4.6–7.7 GB chronic | 0 GB | eliminated |
| Model size supported | ≤26B (VRAM-limited) | ≤70B (at Q4) | 2.7× parameter scale |
| Context window | 8–16K (swap risk) | 128K (fp8 KV) | 8–16× context |
Reference benchmark for consumer Apple Silicon. MacBook Air M4, 16 GB unified memory, llama.cpp b200 (Homebrew) via Metal backend. Establishes the ARM/Metal data point for comparing inference efficiency across hardware tiers. Model: Qwen2.5-7B-Instruct Q4_K_M (4.36 GiB, llama-bench standard test). Run: 2026-05-22.
| Test | M4 MacBook Air | RTX 5060 Ti (tower) | Ratio |
|---|---|---|---|
| pp512 | 207.39 ± 3.61 t/s | ~900–1,100 t/s | 4–5× |
| tg128 | 21.37 ± 0.13 t/s | 73 t/s (dense Q4) | 3.4× |
| Model | Qwen2.5-7B Q4_K_M | Nemotron 30B Q4_K_M | — |
| Memory | 16 GB unified | 16 GB GDDR7 (single GPU) | — |
| Backend | Metal (MTL,BLAS) | CUDA 13.0.3 | — |
| Tensor ops | disabled (pre-M5) | full CUDA graph | — |
| Power envelope | ~15 W TDP | ~165 W TDP | 11× difference |
tensor API disabled for pre-M5 and pre-A19 devices on M4. The Metal tensor execution path (introduced for M5/A19) is not available — Metal Shader Language kernels run the standard path. This is the primary reason M4 generation throughput tops out at ~21 t/s for a 7B dense model, vs the M5 which unlocks the tensor path. Upgrading to an M5 MacBook would push this to ~30+ t/s on the same model. The prefill rate (207 t/s) reflects Metal compute shader efficiency, which is not blocked by this constraint.All benchmark data is public. Experiment logs, TSV results files, autoresearch scripts, and optimal configuration shell scripts are in the inference-research repository.
| Artifact | Location | Contents |
|---|---|---|
| inference-research repo | randomchaos7800-hub/inference-research | All autoresearch scripts, TSV results, optimal config scripts, research logs, incident reports |
| autoresearch scripts | autoresearch-*.py |
9 LLM-driven optimization loops — vLLM (27B, AWQ, AEON, GLM4), llama.cpp (70B, SuperGemma, MoE, Beelink) |
| M4 autoresearch log | autoresearch-qwen25-7b-m4-log.md |
Full 25-experiment sweep log: thread sweep, flash-attn, MLX vs llama.cpp, Qwen3-8B comparison, KV cache sweep, thermal findings |
| M4 results TSV | autoresearch-qwen25-7b-m4-results.tsv |
25-row experiment table: config, pp512 t/s, tg128 t/s, notes — Apple M4 MacBook Air 2026-05-22 |
| TSV results | autoresearch-*-results.tsv |
Per-experiment: variable, value, tg_median, tg_p90, delta, outcome, description |
| optimal config scripts | current-best-flags-*.sh |
Production launch commands for all models with all optimal flags applied — includes M4 Metal config |
| model queue | model-queue.md |
Current deployment state, known constraints, tested model inventory |
| vLLM incident report | vllm-upgrade-incident-2026-05-10.md |
Full post-mortem: 0.19.2rc1 → 0.20.2 failure on SM_120, NCCL crash, kernel pin, recovery procedure |
| Research paper | papers.html | Preprint: Commodity Hardware and Persistent AI Companions — Zenodo CC BY 4.0 |
| Full stack specs | stack.html | Hardware configuration, agent infrastructure, service topology |