Hardware specs, inference configuration, optimization history, and agent infrastructure. Everything documented. Raw numbers only.
Compact two-tier research cluster with a clean split between orchestration and inference. Direct-wired, low-latency, and optimized for local-first operation under real workload conditions.
All inference clients route through a single internal gateway. Backends are hot-swappable without reconfiguring consuming services. The gateway exposes an OpenAI-compatible interface with model aliasing and auth.
| Model | Quant | Gen | Context | Server | Status | Notes |
|---|---|---|---|---|---|---|
| Nemotron 3 Nano 30B (MoE hybrid) | Q4_K_M | 117.34 t/s med / 117.60 peak | 32K | llama.cpp cuda128-clean | active — default | Mamba/SSM hybrid MoE. ~3B active params/token. Requires cuda128-clean build; cuda13 crashes with MMQ CUDA error on SM_120. |
| Qwen3.6-27B GPTQ INT4 + MTP n=3 (Genesis) | AutoRound INT4, fp8 KV | 73.35 t/s | 32K | vLLM 0.21.0 | stopped — replaced 2026-05-20 | Previous default backend. GPTQ-Marlin + VLLM_MARLIN_USE_ATOMIC_ADD=1. MTP n=3 speculative decoding. |
| AEON (Qwen3 NVFP4) | NVFP4 (ModelOpt) | ~69 t/s | 122K | vLLM 0.9+ | stopped | Blackwell-native quantization. MTP speculative decoding n=3. |
| Qwen3.6-27B (SSM hybrid) | Q4_K_M | ~22 t/s | 65K | llama.cpp | stopped | Mamba/SSM hybrid. Fast prefill (960 t/s), slow gen (SSM bottleneck). |
| Gemma 4 26B (CPU baseline) | Q4_K_M | ~11.7 t/s | 32K | llama.cpp | historical | CPU-only orchestration-tier baseline from the pre-GPU phase. Consistent swap usage 4–8 GB. |
130+ automated experiments across six model architectures using a scripted autoresearch loop. Stopping criterion: +5 tok/s improvement per iteration. Full logs at randomchaos7800-hub.
| Configuration | Gen Speed | Prompt Speed | Delta vs Baseline | Notes |
|---|---|---|---|---|
| Dual GPU, all-on-GPU, f16 KV | 107 t/s | 2,436 t/s | +50% vs single GPU | Final config. TP=2, PCIe x8+x4. |
| Single GPU, all-on-GPU, f16 KV | 71 t/s | — | Pre-dual baseline | After CPU offload fix. Clean VRAM fit. |
| Full GPU offload (Exp 2 breakthrough) | 70 t/s | 222 t/s | +118% gen, +200% prompt | vs expert-on-CPU config. |
| Expert tensors on CPU (MoE routing) | 32 t/s | 74 t/s | −55% | CPU bottleneck on MoE expert routing. |
| Metric | Value |
|---|---|
| Sample period | April 2–18, 2026 |
| Model | Gemma 4 26B Q4_K_M |
| Avg gen throughput | 10.49 tok/s |
| Range | 5.02 – 11.72 tok/s |
| Avg time to first token | 616 – 2,224 ms |
| Avg swap used | 4.6 – 7.7 GiB |
| Speedup vs GPU tier | 6.6× (10.5 → 74 tok/s) |
Core agent processes run as hardened background services. Auto-restart, post-boot recovery, and local-first inference routing are all part of the runtime design. Slack Socket Mode remains the primary interactive path for the agent stack.
| Process | Framework | Interface | Model tier | Status |
|---|---|---|---|---|
| Agent Harness | Python / custom | Slack Socket Mode | Nemotron 3 Nano 30B (local) | live |
| Mike | RelayV3 (Python) | Discord + Telegram + Slack + IRC | Internal inference gateway | live |
| Harness Personas | Internal harness roles | Not separately exposed | Internal gateway / task-dependent | internal |
| Hermes | Python / hermes-gateway.service | Gateway agent — local inference via tower:8010 proxy | Local (genesis INT4 / Nemotron MoE) | live |
| Chronicle | Python / cron | 2am daily cron | Claude API | live |
All benchmark data and logs referenced in the research paper are available publicly.
| Artifact | Location | Contents |
|---|---|---|
| Live benchmark data | boundarylabs.org/benchmarks.html | Inference metrics, MESA scores, LongMemEval results |
| Agent infrastructure | randomchaos7800-hub | Agent harness, benchmark tooling, optimization logs |
| Research paper | papers.html | Full preprint text, tables, references |
| Weekly build logs | dinovitale.com | Weekly recap posts with build decisions and operational notes |