boundary.labs
Boundary Labs  /  Research Infrastructure

The Stack

Hardware specs, inference configuration, optimization history, and agent infrastructure. Everything documented. Raw numbers only.

Compact two-tier research cluster with a clean split between orchestration and inference. Direct-wired, low-latency, and optimized for local-first operation under real workload conditions.

Primary Inference Node

SystemCyberPowerPC GXi3400BSTV17
CPUIntel Core Ultra 7 265F
CPU config20c/20t  ·  5.3GHz boost
GPU ×2RTX 5060 Ti 16GB GDDR7
ArchitectureBlackwell SM_120
VRAM total32 GB GDDR7 (TP=2)
PCIe configx8 Gen5 + x4 Gen4
RAM32 GB DDR5 (192 GB max)
Storage2 TB PCIe 4 NVMe
CUDA13.0.3
InferencevLLM + llama.cpp
GatewaySingle internal inference gateway
OSUbuntu 24.04

Orchestration Node

SystemBeelink EQI12
CPUIntel i5-1235U (12th Gen)
CPU config12c  ·  4.4GHz boost
RAM32 GB DDR4
Storage1 TB NVMe
OSUbuntu 24.04
UptimeAlways-on
Active servicesharness, Mike, Hermes, autoresearch, chronicle
Webnginx  ·  Cloudflare tunnel
NetworkDirect-wired to inference tier
Remote accessPrivate mesh access
Backup476 GB NVMe  ·  3am daily  ·  30-day retention

Dev Workstation

SystemMacBook Air M4
RAM32 GB unified
RolePrimary dev, Claude Code CLI
Remote accessPrivate mesh access

Network  — Topology

Cluster linkDirect Ethernet
ExternalPrivate mesh access
Inference accessSingle internal gateway for all clients
Web routingCloudflare tunnel → nginx
SSHPrivate-only, not exposed publicly

All inference clients route through a single internal gateway. Backends are hot-swappable without reconfiguring consuming services. The gateway exposes an OpenAI-compatible interface with model aliasing and auth.

117
tok/s peak (dual GPU, MoE)
Nemotron 3 Nano 30B, 2026-05-20
2,436
tok/s prompt (dual GPU)
prefill throughput
32 GB
VRAM available (TP=2)
dual RTX 5060 Ti
0
swap used (GPU tier)
vs 4–8 GB on CPU tier
Model Quant Gen Context Server Status Notes
Nemotron 3 Nano 30B (MoE hybrid) Q4_K_M 117.34 t/s med / 117.60 peak 32K llama.cpp cuda128-clean active — default Mamba/SSM hybrid MoE. ~3B active params/token. Requires cuda128-clean build; cuda13 crashes with MMQ CUDA error on SM_120.
Qwen3.6-27B GPTQ INT4 + MTP n=3 (Genesis) AutoRound INT4, fp8 KV 73.35 t/s 32K vLLM 0.21.0 stopped — replaced 2026-05-20 Previous default backend. GPTQ-Marlin + VLLM_MARLIN_USE_ATOMIC_ADD=1. MTP n=3 speculative decoding.
AEON (Qwen3 NVFP4) NVFP4 (ModelOpt) ~69 t/s 122K vLLM 0.9+ stopped Blackwell-native quantization. MTP speculative decoding n=3.
Qwen3.6-27B (SSM hybrid) Q4_K_M ~22 t/s 65K llama.cpp stopped Mamba/SSM hybrid. Fast prefill (960 t/s), slow gen (SSM bottleneck).
Gemma 4 26B (CPU baseline) Q4_K_M ~11.7 t/s 32K llama.cpp historical CPU-only orchestration-tier baseline from the pre-GPU phase. Consistent swap usage 4–8 GB.
Blackwell SM_120 fast CUDA kernel paths: q4_0 and f16 KV only. q8_0, q5_0, iq4_nl all degrade significantly. NVFP4 (ModelOpt) is Blackwell-native quantization — 4× smaller KV footprint vs bf16, requires CUDA 12.8+. This system runs CUDA 13.0.3; Qwen3-series NVFP4 (FlashInfer CUTLASS SM_120 FP4 kernels) should now be unblocked — previously blocked on CUDA 12.8. llama.cpp users: build target matters. cuda13 crashes with CUDA MMQ error on SM_120; cuda128-clean required for llama.cpp GGUF inference.

130+ automated experiments across six model architectures using a scripted autoresearch loop. Stopping criterion: +5 tok/s improvement per iteration. Full logs at randomchaos7800-hub.

Key Configurations — Qwen3.6-35B-A3B MoE (dual GPU, April 2026 campaign)

ConfigurationGen SpeedPrompt SpeedDelta vs BaselineNotes
Dual GPU, all-on-GPU, f16 KV 107 t/s 2,436 t/s +50% vs single GPU Final config. TP=2, PCIe x8+x4.
Single GPU, all-on-GPU, f16 KV 71 t/s Pre-dual baseline After CPU offload fix. Clean VRAM fit.
Full GPU offload (Exp 2 breakthrough) 70 t/s 222 t/s +118% gen, +200% prompt vs expert-on-CPU config.
Expert tensors on CPU (MoE routing) 32 t/s 74 t/s −55% CPU bottleneck on MoE expert routing.

CPU Inference Baseline — Orchestration Tier (17-day log)

MetricValue
Sample periodApril 2–18, 2026
ModelGemma 4 26B Q4_K_M
Avg gen throughput10.49 tok/s
Range5.02 – 11.72 tok/s
Avg time to first token616 – 2,224 ms
Avg swap used4.6 – 7.7 GiB
Speedup vs GPU tier6.6× (10.5 → 74 tok/s)
Swap usage is the operationally significant CPU-tier finding. In the earlier multi-agent phase, co-locating several always-on processes and local inference on a 32 GB machine consistently drove swap. The architectural fix — separate inference and orchestration nodes — eliminates this entirely. Not a tuning problem; an architectural one.

Autoresearch Results by Category

Optimization delta from experiment baseline (MoE, dual GPU)
full GPU offload (MoE experts)
+118%
dual GPU (TP=2)
+50%
f16 KV cache
+12%
q8_0 KV cache
−18%
expert tensors on CPU
−55%

Core agent processes run as hardened background services. Auto-restart, post-boot recovery, and local-first inference routing are all part of the runtime design. Slack Socket Mode remains the primary interactive path for the agent stack.

ProcessFrameworkInterfaceModel tierStatus
Agent Harness Python / custom Slack Socket Mode Nemotron 3 Nano 30B (local) live
Mike RelayV3 (Python) Discord + Telegram + Slack + IRC Internal inference gateway live
Harness Personas Internal harness roles Not separately exposed Internal gateway / task-dependent internal
Hermes Python / hermes-gateway.service Gateway agent — local inference via tower:8010 proxy Local (genesis INT4 / Nemotron MoE) live
Chronicle Python / cron 2am daily cron Claude API live

Service Hardening Details

Crash recovery and boot recovery are actively hardened, but the exact unit settings evolve with the live system. The durable design claim is local-first operation: the core agent runtime stays on local infrastructure, while OpenRouter and Claude API remain optional supporting paths.

All benchmark data and logs referenced in the research paper are available publicly.

ArtifactLocationContents
Live benchmark data boundarylabs.org/benchmarks.html Inference metrics, MESA scores, LongMemEval results
Agent infrastructure randomchaos7800-hub Agent harness, benchmark tooling, optimization logs
Research paper papers.html Full preprint text, tables, references
Weekly build logs dinovitale.com Weekly recap posts with build decisions and operational notes