Boundary Labs / Research Infrastructure

The Stack

Hardware specs, inference configuration, optimization history, and agent infrastructure. Everything documented. Raw numbers only.

Research Cluster

Two-machine cluster connected by direct-wire Ethernet on an isolated /30 subnet (10.10.10.0/30). Sub-millisecond latency between nodes. cha0tikhome handles all orchestration and agent processes. cha0tiktower is inference-only. Both connected to Tailscale mesh for remote access.

cha0tiktower — Primary Inference Node

SystemCyberPowerPC GXi3400BSTV17

CPUIntel Core Ultra 7 265F

CPU config20c/20t · 5.3GHz boost

GPU ×2RTX 5060 Ti 16GB GDDR7

ArchitectureBlackwell SM_120

VRAM total32 GB GDDR7 (TP=2)

PCIe configx8 Gen5 + x4 Gen4

RAM32 GB DDR5 (192 GB max)

Storage2 TB PCIe 4 NVMe

CUDA12.8

InferencevLLM + llama.cpp

Proxylocal-proxy :8010

OSUbuntu 24.04

cha0tikhome — Orchestration Node

SystemBeelink EQI12

CPUIntel i5-1235U (12th Gen)

CPU config12c · 4.4GHz boost

RAM32 GB DDR4

Storage1 TB NVMe

OSUbuntu 24.04

UptimeAlways-on

AgentsFrank, Kato, CJ, Mike, Dave, Morty, Sabrina, Hermes

Webnginx :8080 · cloudflared tunnel

NetworkDirect wire to tower

VPNTailscale 100.94.10.36

Backup476 GB NVMe · 3am daily · 30-day retention

cha0tikmac — Dev Workstation

SystemMacBook Air M4

RAM32 GB unified

RolePrimary dev, Claude Code CLI

VPNTailscale 100.83.149.28

Network — Topology

Cluster linkDirect Ethernet 10.10.10.0/30

ExternalTailscale mesh (all nodes)

Inference proxytower:8010 (all clients)

Web routingcloudflared → nginx :8080

SSHTailscale-only (not exposed)

Inference Configuration

All inference clients route through a single local proxy on tower:8010. Backends are hot-swappable via config without reconfiguring any consuming agent. OpenAI-compatible API with model aliasing and auth. Switch command: proxy-switch genesis|aeon.

107

tok/s peak (dual GPU, MoE)

Qwen3.6-35B-A3B, 2026-04-20

2,436

tok/s prompt (dual GPU)

prefill throughput

32 GB

VRAM available (TP=2)

dual RTX 5060 Ti

swap used (GPU tier)

vs 4–8 GB on CPU tier

Model	Quant	Gen	Context	Server	Status	Notes
Qwen3.6-35B-A3B (MoE)	UD-Q4_K_M	~107 t/s	65K	llama.cpp	active	Genesis. MoE: only 3B active params per token. f16 KV. Default backend.
Qwen3 27B (dense)	GPTQ-Marlin INT4, fp8 KV	~85 t/s	160K	vLLM 0.9+	active	Long-context workloads. fp8 KV halves VRAM vs bf16.
AEON (Qwen3 NVFP4)	NVFP4 (ModelOpt)	~69 t/s	122K	vLLM 0.9+	stopped	Blackwell-native quantization. MTP speculative decoding n=3.
Qwen3.6-27B (SSM hybrid)	Q4_K_M	~22 t/s	65K	llama.cpp	stopped	Mamba/SSM hybrid. Fast prefill (960 t/s), slow gen (SSM bottleneck).
Gemma 4 26B (CPU baseline)	Q4_K_M	~11.7 t/s	32K	llama.cpp	historical	CPU-only on cha0tikhome. Pre-tower reference. Consistent swap usage 4–8 GB.

Blackwell SM_120 fast CUDA kernel paths: q4_0 and f16 KV only. q8_0, q5_0, iq4_nl all degrade significantly. NVFP4 (ModelOpt) is Blackwell-native quantization — 4× smaller KV footprint vs bf16, requires CUDA 12.8+.

Inference Optimization — Autoresearch Log

39 automated experiments across two model architectures using a scripted autoresearch loop. Stopping criterion: +5 tok/s improvement per iteration. Full logs at randomchaos7800-hub.

Key Configurations — Qwen3.6-35B-A3B MoE (dual GPU)

Configuration	Gen Speed	Prompt Speed	Delta vs Baseline	Notes
Dual GPU, all-on-GPU, f16 KV	107 t/s	2,436 t/s	+50% vs single GPU	Final config. TP=2, PCIe x8+x4.
Single GPU, all-on-GPU, f16 KV	71 t/s	—	Pre-dual baseline	After CPU offload fix. Clean VRAM fit.
Full GPU offload (Exp 2 breakthrough)	70 t/s	222 t/s	+118% gen, +200% prompt	vs expert-on-CPU config.
Expert tensors on CPU (MoE routing)	32 t/s	74 t/s	−55%	CPU bottleneck on MoE expert routing.

CPU Inference Baseline — cha0tikhome (17-day log)

Metric	Value
Sample period	April 2–18, 2026
Model	Gemma 4 26B Q4_K_M
Avg gen throughput	10.49 tok/s
Range	5.02 – 11.72 tok/s
Avg time to first token	616 – 2,224 ms
Avg swap used	4.6 – 7.7 GiB
Speedup vs GPU tier	6.6× (10.5 → 74 tok/s)

Swap usage is the operationally significant CPU-tier finding. Co-locating five always-on agents and local inference on a 32 GB machine consistently drives swap. The architectural fix — separate inference and orchestration nodes — eliminates this entirely. Not a tuning problem; an architectural one.

Autoresearch Results by Category

Optimization delta from experiment baseline (MoE, dual GPU)

full GPU offload (MoE experts)

+118%

dual GPU (TP=2)

+50%

f16 KV cache

+12%

q8_0 KV cache

−18%

expert tensors on CPU

−55%

Agent Infrastructure

All agents run as systemd user services. Restart=always, StartLimitBurst=3, StartLimitIntervalSec=60. Recovery under 5 seconds on any crash. All inference via tower:8010. Slack Socket Mode (bi-directional) for interactive agents.

Agent	Framework	Interface	Model tier	Status
Frank (harness)	Python / custom	Slack Socket Mode :8890	Qwen3.6 MoE (local)	live
Mike	RelayV3 (Python)	Discord + Telegram + Slack + IRC	Local proxy :8010	live
Kato	TypeScript / Frank persona	Slack Socket Mode	Sonnet 4.6 / local	live
Dave (CFO)	TypeScript / Frank persona	Slack Socket Mode / #finance	Local proxy :8010	live
Hermes	Python / hermes-gateway.service	OpenRouter-backed gateway	OpenRouter	live
CJ Craig	TypeScript / Frank persona	Slack Socket Mode	Sonnet 4.6 / local	live
Morty	TypeScript / Frank persona	Slack Socket Mode	Haiku 4.5 (Agent SDK)	live
Sabrina	Python / bot.py	Telegram + Slack (REST)	Local proxy :8010	live
Chronicle	Python / cron	2am daily cron	Claude API	live

Service Hardening Details

All agents tested with kill -9 → verified restart under 5 seconds. Mattermost decommissioned 2026-03-01. All agents migrated to Slack Socket Mode. Zero external API dependencies for core agent function at Tier 2 hardware — OpenRouter and Claude API remain available as fallbacks, configured to alert loudly when triggered.

Raw Data & Artifacts

All benchmark data and logs referenced in the research paper are available publicly.

Artifact	Location	Contents
Live benchmark data	dinovitale.com/benchmarks.html	Inference metrics, MESA scores, LongMemEval results
Agent infrastructure	randomchaos7800-hub	Frank harness, benchmark tooling, optimization logs
Research paper	papers.html	Full preprint text, tables, references
Weekly build logs	dinovitale.com	Weekly recap posts with build decisions and operational notes