boundary.labs
Boundary Labs  /  Research Infrastructure

The Stack

Hardware specs, inference configuration, optimization history, and agent infrastructure. Everything documented. Raw numbers only.

Two-machine cluster connected by direct-wire Ethernet on an isolated /30 subnet (10.10.10.0/30). Sub-millisecond latency between nodes. cha0tikhome handles all orchestration and agent processes. cha0tiktower is inference-only. Both connected to Tailscale mesh for remote access.

cha0tiktower  — Primary Inference Node

SystemCyberPowerPC GXi3400BSTV17
CPUIntel Core Ultra 7 265F
CPU config20c/20t  ·  5.3GHz boost
GPU ×2RTX 5060 Ti 16GB GDDR7
ArchitectureBlackwell SM_120
VRAM total32 GB GDDR7 (TP=2)
PCIe configx8 Gen5 + x4 Gen4
RAM32 GB DDR5 (192 GB max)
Storage2 TB PCIe 4 NVMe
CUDA12.8
InferencevLLM + llama.cpp
Proxylocal-proxy :8010
OSUbuntu 24.04

cha0tikhome  — Orchestration Node

SystemBeelink EQI12
CPUIntel i5-1235U (12th Gen)
CPU config12c  ·  4.4GHz boost
RAM32 GB DDR4
Storage1 TB NVMe
OSUbuntu 24.04
UptimeAlways-on
AgentsFrank, Kato, CJ, Mike, Dave, Morty, Sabrina, Hermes
Webnginx :8080  ·  cloudflared tunnel
NetworkDirect wire to tower
VPNTailscale 100.94.10.36
Backup476 GB NVMe  ·  3am daily  ·  30-day retention

cha0tikmac  — Dev Workstation

SystemMacBook Air M4
RAM32 GB unified
RolePrimary dev, Claude Code CLI
VPNTailscale 100.83.149.28

Network  — Topology

Cluster linkDirect Ethernet 10.10.10.0/30
ExternalTailscale mesh (all nodes)
Inference proxytower:8010 (all clients)
Web routingcloudflared → nginx :8080
SSHTailscale-only (not exposed)

All inference clients route through a single local proxy on tower:8010. Backends are hot-swappable via config without reconfiguring any consuming agent. OpenAI-compatible API with model aliasing and auth. Switch command: proxy-switch genesis|aeon.

107
tok/s peak (dual GPU, MoE)
Qwen3.6-35B-A3B, 2026-04-20
2,436
tok/s prompt (dual GPU)
prefill throughput
32 GB
VRAM available (TP=2)
dual RTX 5060 Ti
0
swap used (GPU tier)
vs 4–8 GB on CPU tier
Model Quant Gen Context Server Status Notes
Qwen3.6-35B-A3B (MoE) UD-Q4_K_M ~107 t/s 65K llama.cpp active Genesis. MoE: only 3B active params per token. f16 KV. Default backend.
Qwen3 27B (dense) GPTQ-Marlin INT4, fp8 KV ~85 t/s 160K vLLM 0.9+ active Long-context workloads. fp8 KV halves VRAM vs bf16.
AEON (Qwen3 NVFP4) NVFP4 (ModelOpt) ~69 t/s 122K vLLM 0.9+ stopped Blackwell-native quantization. MTP speculative decoding n=3.
Qwen3.6-27B (SSM hybrid) Q4_K_M ~22 t/s 65K llama.cpp stopped Mamba/SSM hybrid. Fast prefill (960 t/s), slow gen (SSM bottleneck).
Gemma 4 26B (CPU baseline) Q4_K_M ~11.7 t/s 32K llama.cpp historical CPU-only on cha0tikhome. Pre-tower reference. Consistent swap usage 4–8 GB.
Blackwell SM_120 fast CUDA kernel paths: q4_0 and f16 KV only. q8_0, q5_0, iq4_nl all degrade significantly. NVFP4 (ModelOpt) is Blackwell-native quantization — 4× smaller KV footprint vs bf16, requires CUDA 12.8+.

39 automated experiments across two model architectures using a scripted autoresearch loop. Stopping criterion: +5 tok/s improvement per iteration. Full logs at randomchaos7800-hub.

Key Configurations — Qwen3.6-35B-A3B MoE (dual GPU)

ConfigurationGen SpeedPrompt SpeedDelta vs BaselineNotes
Dual GPU, all-on-GPU, f16 KV 107 t/s 2,436 t/s +50% vs single GPU Final config. TP=2, PCIe x8+x4.
Single GPU, all-on-GPU, f16 KV 71 t/s Pre-dual baseline After CPU offload fix. Clean VRAM fit.
Full GPU offload (Exp 2 breakthrough) 70 t/s 222 t/s +118% gen, +200% prompt vs expert-on-CPU config.
Expert tensors on CPU (MoE routing) 32 t/s 74 t/s −55% CPU bottleneck on MoE expert routing.

CPU Inference Baseline — cha0tikhome (17-day log)

MetricValue
Sample periodApril 2–18, 2026
ModelGemma 4 26B Q4_K_M
Avg gen throughput10.49 tok/s
Range5.02 – 11.72 tok/s
Avg time to first token616 – 2,224 ms
Avg swap used4.6 – 7.7 GiB
Speedup vs GPU tier6.6× (10.5 → 74 tok/s)
Swap usage is the operationally significant CPU-tier finding. Co-locating five always-on agents and local inference on a 32 GB machine consistently drives swap. The architectural fix — separate inference and orchestration nodes — eliminates this entirely. Not a tuning problem; an architectural one.

Autoresearch Results by Category

Optimization delta from experiment baseline (MoE, dual GPU)
full GPU offload (MoE experts)
+118%
dual GPU (TP=2)
+50%
f16 KV cache
+12%
q8_0 KV cache
−18%
expert tensors on CPU
−55%

All agents run as systemd user services. Restart=always, StartLimitBurst=3, StartLimitIntervalSec=60. Recovery under 5 seconds on any crash. All inference via tower:8010. Slack Socket Mode (bi-directional) for interactive agents.

AgentFrameworkInterfaceModel tierStatus
Frank (harness) Python / custom Slack Socket Mode :8890 Qwen3.6 MoE (local) live
Mike RelayV3 (Python) Discord + Telegram + Slack + IRC Local proxy :8010 live
Kato TypeScript / Frank persona Slack Socket Mode Sonnet 4.6 / local live
Dave (CFO) TypeScript / Frank persona Slack Socket Mode / #finance Local proxy :8010 live
Hermes Python / hermes-gateway.service OpenRouter-backed gateway OpenRouter live
CJ Craig TypeScript / Frank persona Slack Socket Mode Sonnet 4.6 / local live
Morty TypeScript / Frank persona Slack Socket Mode Haiku 4.5 (Agent SDK) live
Sabrina Python / bot.py Telegram + Slack (REST) Local proxy :8010 live
Chronicle Python / cron 2am daily cron Claude API live

Service Hardening Details

All agents tested with kill -9 → verified restart under 5 seconds. Mattermost decommissioned 2026-03-01. All agents migrated to Slack Socket Mode. Zero external API dependencies for core agent function at Tier 2 hardware — OpenRouter and Claude API remain available as fallbacks, configured to alert loudly when triggered.

All benchmark data and logs referenced in the research paper are available publicly.

ArtifactLocationContents
Live benchmark data dinovitale.com/benchmarks.html Inference metrics, MESA scores, LongMemEval results
Agent infrastructure randomchaos7800-hub Frank harness, benchmark tooling, optimization logs
Research paper papers.html Full preprint text, tables, references
Weekly build logs dinovitale.com Weekly recap posts with build decisions and operational notes