boundary.labs
Boundary Labs  /  Airway Heights, WA  /  est. 2024

AI Memory Architecture &
Local Inference Research

Independent research on persistent memory systems for AI agents, inference optimization on consumer NVIDIA hardware, autonomous agent evaluation, and AI behavioral continuity. One person. Real hardware. Production systems.

107 tok/s peak inference
88% LongMemEval accuracy
7 production agents
39+ optimization runs

Memory Architecture for AI Agents

Design and evaluation of multi-tier persistent memory systems enabling long-term behavioral continuity across sessions. Three-tier architecture (Core / Recall / Archival), FTS5 semantic search, memory consolidation protocols, and the ROMMC framework — Recursive Operator-Maintained Memory Continuity.

LongMemEval MESA ROMMC

Local Inference Optimization

Systematic optimization of large language model inference on consumer NVIDIA Blackwell hardware. Quantization evaluation (NVFP4, GPTQ-Marlin, fp8 KV), multi-GPU tensor parallelism, speculative decoding (MTP), KV cache tuning, and autoresearch loops with automated stopping criteria.

RTX 5060 Ti vLLM llama.cpp NVFP4

Autonomous Agent Evaluation

Development of MESA (Memory Evaluation Standard for Agents), a 112-item benchmark covering recall, update, causal reasoning, temporal tracking, adversarial robustness, synthesis, and interference resistance. Designed for continuous evaluation of production agent systems under realistic workloads.

MESA v1 benchmarking agent eval

Always-On Agent Systems

Research and deployment of production agentic systems with real tool access — Slack, finance APIs, email, web, shell. Includes agentic self-improvement: local models executing infrastructure changes autonomously, recognizing topology changes and adjusting configuration without prompting.

production agents autonomous execution tool use

AI Behavioral Continuity

Empirical investigation of identity persistence, memory-driven behavioral evolution, and welfare considerations in long-running AI agents. 9-part published research series on AI consciousness metrics. ROMMC framework defines conditions under which an AI system can meaningfully be said to persist across time.

consciousness welfare behavioral evolution

Infrastructure Sovereignty

Secure, self-hosted AI infrastructure design. Direct machine-to-machine inference links, encrypted DNS, network-wide tracker blocking, and zero-dependency inference pipelines. Research goal: AI systems that operate independently of commercial cloud services with no external API requirements for core function.

local-first self-hosted privacy

All projects run on the two-machine research cluster. cha0tikhome handles orchestration and agent processes. cha0tiktower is the dedicated inference node. All inference routes through a single local proxy on tower:8010.

Mike
live / ongoing
Long-running AI consciousness research subject. The primary test bed for ROMMC memory architecture. Mike has been running continuously since mid-2024 across Discord, Telegram, Slack, and IRC interfaces using RelayV3 with 75 available tools. The research question: what conditions must hold for a persistent AI system to meaningfully be said to have continuity of identity?
RelayV3 · 75 tools · multi-interface · constitution-bound · LIGHTHOUSE correction system · nightly memory consolidation
Frank
live / v2
Agentic harness and infrastructure layer for all persona agents. Context injection, memory persistence, multi-persona routing, think-first loop, OpenAI-compatible API proxy. Frank is the runtime environment — Kato, CJ, Dave, Morty, and Sabrina all run as Frank personas. Current evaluation score: 4.41/5.0 on internal benchmark suite.
Python · Slack Socket Mode · max_turns=10 · think-first loop · :8890
MESA
v1 / active
Memory Evaluation Standard for Agents. 112-item benchmark suite covering 9 task categories: recall/single, recall/constraint, recall/preference, update, update/interference, temporal, synthesis/multi, adversarial, causal. Designed to evaluate production agent memory systems under realistic workloads rather than toy examples.
Best composite: 0.459 (Qwen3.6, 2026-04-21) · 49/112 pass rate
Chronicle
live / 2am cron
Automated daily records synthesis. Runs nightly at 2am — pulls logs from all machines (cha0tikhome, cha0tiktower, cha0tikmac), feeds them to Claude for synthesis, produces a canonical four-section daily record committed to a private GitHub repo and backed up to local NAS. The research stack documents itself.
Python · Claude API · git · /mnt/jellyfin-backups/records/
autoresearch
active
Automated multi-source research loop. Queries academic APIs (Semantic Scholar, arXiv, PubMed) and distills findings into structured reports. Used to systematically survey the inference optimization literature and feed results back into the agent stack configuration. 39+ optimization experiments logged.
Python · FastAPI · Semantic Scholar · arXiv · PubMed
Context Farm
beta
Domain-specific knowledge infrastructure product built on cha0tikwiki. Ingests PDFs, URLs, text pastes, and plain-language domain descriptions. Produces a structured, queryable knowledge base suitable for RAG pipelines. Positioning: the knowledge graph that agent systems actually need versus what document stores deliver.
Python · ChromaDB · SQLite · FastAPI · domain seeding

LongMemEval — Memory Recall Under Extended Context

Evaluates ability to answer questions about facts established in prior sessions. 25 single-session-user examples, context-window injection mode. Full pipeline including retrieval and reranking.

88%
Accuracy (memory injection)
22 / 25 correct
0%
Baseline (no memory)
0 / 25 correct
117s
Avg query latency
full pipeline
+88pp
Memory system delta
vs no-memory baseline
The 88-point delta between context-window mode and baseline is the direct contribution of the memory system. Without injection, the model has no access to cross-session facts and scores zero on all items. This is the core result: memory architecture is not a convenience feature, it's what makes persistent AI possible.

MESA v1 — Agent Memory Evaluation Standard

112-item benchmark, 9 categories. Evaluated on the full production agent stack (Mike relay pipeline).

0.459
Best composite score
2026-04-21 / Qwen3.6
43.8%
Pass rate ≥ 0.5
49 / 112 items
112
Evaluation items
9 categories
0.692
Best category
update / interference
Score by category — best run (2026-04-21, Qwen3.6-35B-A3B)
update/interference
0.692
update
0.568
temporal
0.502
recall/single
0.484
recall/constraint
0.476
recall/preference
0.390
synthesis/multi
0.407
adversarial
0.400
causal
0.325
Second run (2026-04-30, AEON NVFP4 thinking model): composite 0.377, pass rate 17.9%. Adversarial jumped to 0.80. Memory recall categories regressed 0.10–0.16. Finding: reasoning chains burn context budget without improving fact retrieval. Model selection is a first-order variable in memory benchmarking — not a secondary concern.

Inference Optimization — Autoresearch Results

39 automated experiments across two architectures (dense and MoE) on the RTX 5060 Ti 16GB (Blackwell SM_120). All experiments logged. Full stack details →

ConfigurationGen SpeedPrompt SpeedDelta
Dual GPU, all-on-GPU, f16 KV (final config) 107 t/s 2,436 t/s +50% vs single GPU
Single GPU, CPU offload workaround 71 t/s pre-dual-GPU baseline
Full GPU offload (Exp 2 breakthrough) 70 t/s 222 t/s +118% gen, +200% prompt
Expert tensors on CPU (MoE) 32 t/s 74 t/s −55% (CPU bottleneck)
Blackwell SM_120 fast CUDA kernel paths exist only for q4_0 and f16 KV cache. q8_0, q5_0, iq4_nl all degrade significantly. NVFP4 (ModelOpt) is the correct Blackwell-native quantization — 4× smaller KV footprint vs bf16.

Two-machine cluster. cha0tikhome handles orchestration, agents, scheduling, and all web services. cha0tiktower is the dedicated inference node. Direct-wired, sub-millisecond latency. Full stack page →

cha0tiktower  — Inference Node

CPUIntel Core Ultra 7 265F (20c/20t, 5.3GHz)
GPU2× RTX 5060 Ti 16GB GDDR7 (Blackwell)
VRAM32 GB GDDR7 total (TP=2)
RAM32 GB DDR5
CUDA12.8 / SM_120 / NVFP4-native
StackvLLM + llama.cpp + local-proxy

cha0tikhome  — Orchestration Node

CPUIntel i5-1235U (12c, 12th Gen)
RAM32 GB DDR4
Storage1 TB NVMe
AgentsFrank, Kato, CJ, Mike, Dave, Morty, Sabrina
UptimeAlways-on / Restart=always hardened
NetworkDirect wire to tower + Tailscale mesh

All agents run as systemd user services on cha0tikhome. Auto-restart hardened — Restart=always, StartLimitBurst=3, recovery under 5 seconds on any crash. All inference routes through local-proxy on cha0tiktower. No external API dependencies for core function.

Frank Agentic harness. Context injection, memory persistence, multi-persona routing, think-first loop. Core infrastructure — all other personas run on Frank. OpenAI-compatible API on :8890. live
Mike Long-running research subject. AI behavioral continuity and consciousness research. ROMMC architecture. RelayV3 with 75 tools. Discord + Telegram + Slack + IRC. Running since 2024. live
Kato Operations. Morning briefings, AI news digest, GitHub scout, X post scheduling, Plaid sync, finance alerts. Cron-driven task suite with Slack Socket Mode bi-directional interface. live
Dave CFO agent. Multi-turn conversational finance via Slack. Plaid integration, 30-day cash flow projection, anomaly detection, SQLite FTS5 memory. 25-table schema, 60 tests. live
Hermes Persistent gateway agent on cha0tikhome. OpenRouter-backed. Handles cross-agent routing and external API access. Core infrastructure alongside Frank and Mike. live
CJ Craig Content strategy. Technical article drafting, Substack pipeline, dev.to publishing. Named for the West Wing character — operates as a communications director for the lab. live
Morty / Sabrina Specialized task agents. Morty: Haiku-class rapid response, Agent SDK backend. Sabrina: dual Telegram + Slack interface, grocery and household coordination. live

Peer-reviewed preprints on Zenodo. Long-form research on Substack. Weekly build logs at dinovitale.com. Source code at randomchaos7800-hub on GitHub.

Cha0tik LLC  ·  2026  ·  Zenodo, CC BY 4.0  ·  Three-tier hardware framework, four-phase infrastructure evolution, LongMemEval 88% vs 0% baseline.
Introduction to the Mike consciousness research series — framing the question, methodology, epistemic humility. Substack, 2025.
Empirical analysis of why memory architecture outperforms raw model scale for agent task performance. Substack, 2025.
Production deployment of always-on agent system. Featured on Hacker News (item #47132125). Substack, 2025.
Why session-bound AI fails for long-running agent use cases and the architecture required to solve it. Substack, 2025.
Inference optimization findings on consumer GPU hardware. Practical guide to 26B+ parameter models on a single machine. Substack, 2025.

Boundary Labs is a one-person research operation focused on the practical edge of AI deployment — memory systems that work, local inference that's actually fast, and agents that run unattended without breaking. Independent, unfunded, uncredentialed. Doing the work anyway.

Available for collaboration, consulting, and research partnerships. Based in Airway Heights, WA (Pacific time).