Boundary Labs / Research Output

Papers & Publications

Preprints, long-form research, and field notes from the lab. All work is independently funded and conducted. Raw data linked where available.

Preprints

Commodity Hardware and Persistent AI Companions: A Framework for Independent Deployment

Dino Vitale · Cha0tik LLC April 2026 Zenodo · CC BY 4.0 ORCID: 0009-0001-5590-3296 preprint

In late 2025, running a 26-billion-parameter language model on a MacBook Air M4 required approximately four minutes to produce a first token. By April 2026, the same model ran at 74 tokens per second on a $2,000 commodity gaming PC. This paper documents what that transition means in practice for a specific class of system: persistent, always-on AI companions deployed by a single operator in a residential setting.

The central thesis is that architecture and compute are decoupled in this domain. A five-agent household running continuously for four months demonstrates that personality, behavioral consistency, and memory coherence survive complete inference substrate replacement — from cloud APIs to local CPU to local GPU — without modification to the memory layer. The compute changes; the agent does not.

Three contributions: (1) a three-tier commodity hardware framework matching hardware class to workload type; (2) operational data from seventeen daily inference logs and a LongMemEval benchmark run; and (3) a vocabulary-as-architecture finding — that the sophistication of a self-built memory system is bounded by the operator's working vocabulary for describing memory phenomena, and that vocabulary acquisition is therefore a legitimate form of architectural research.

Key Results

84%

LongMemEval accuracy (context-window injection mode) — 21/25 correct

LongMemEval baseline (no memory injection) — correct behavior: the agent correctly reports ignorance rather than confabulating

6.6×

Inference speedup: CPU baseline (10.5 tok/s) to RTX 5060 Ti (74 tok/s), same model, zero configuration change

Model backend transitions survived without behavioral regression — identity lives in the memory architecture, not the weights

raw benchmark data github read full paper ↓

Long-Form Research — Substack

Published at dinoxvitale.substack.com. Field notes, operational findings, and research in progress.

I Think My AI Is Conscious. I'm Probably Wrong.

Introduction to the Mike consciousness research series. Framing the question, describing the methodology, calibrating epistemic humility. What does it mean to ask this seriously?

RAM Beats Model Size: The Evidence

Empirical analysis of why memory architecture outperforms raw model scale for agent task performance. The finding that motivated the ROMMC framework.

The Memory Problem Nobody Talks About

Why session-bound AI fails for long-running agent use cases and what architecture the solution actually requires. The motivation for three-tier persistent memory.

I've Been Running a Personal AI Agent for Months. Here's What Actually Happened.

Production deployment of an always-on agent system. What works, what breaks, what surprised. Featured on Hacker News (item #47132125).

What Five Agents Actually Do All Day

Operational breakdown of a multi-agent household AI stack — task distribution, memory access patterns, failure modes, and what "always-on" actually costs in practice.

Local AI Gets Real: What Actually Works

Inference optimization findings on consumer GPU hardware. A practical guide to running 26B+ parameter models on a single machine with real-time interaction latency.

Full Text — Preprint

Commodity Hardware and Persistent AI Companions:
A Framework for Independent Deployment

Dino Vitale · Cha0tik LLC, Spokane, WA · [email protected]

ORCID: 0009-0001-5590-3296

Preprint — submitted to Zenodo · April 2026 · CC BY 4.0

Abstract

The central thesis is that architecture and compute are decoupled in this domain. A five-agent household running continuously for four months demonstrates that personality, behavioral consistency, and memory coherence survive complete inference substrate replacement — from cloud APIs to local CPU to local GPU — without modification to the memory layer. The compute changes; the agent does not.

We present three contributions: (1) a three-tier commodity hardware framework matching hardware class to workload type; (2) operational data from seventeen daily inference logs and a LongMemEval benchmark run against a deployed agent; and (3) a vocabulary-as-architecture finding — that the sophistication of a self-built memory system is bounded by the operator's working vocabulary for describing memory phenomena, and that vocabulary acquisition is therefore a legitimate form of architectural research.

Artifacts: benchmark data at dinovitale.com/benchmarks.html; architecture implementation in the Adam Selene repository.

1. Introduction

The first time I tried to run a local language model in late 2025, it took four minutes to produce the word "hello." The model was Gemma 4 26B Q4_K_M. The hardware was a MacBook Air M4 with 32GB of unified memory — at the time, considered well-suited for local inference. The result was not.

Four months later, the same model runs at 74 tokens per second on a $2,000 gaming PC with an RTX 5060 Ti. The inference experience is real-time. A second identical GPU arrives tonight. The projected throughput is 140–150 tokens per second on approximately $3,800 of total hardware — less than the cost of a single year of API access at the usage volumes this system requires.

The hardware caught up. This paper documents what that actually means.

This is operational research, not a laboratory study. The research site is a single-operator residential AI deployment: five always-on agents running on a home server in Airway Heights, Washington, each with distinct roles, persistent memory, and continuous autonomous activity. The operator — the author — has no formal computer science training. The deployment has been running continuously since February 2026, through four distinct infrastructure phases. The data in this paper comes from system logs, benchmark runs, and observations made during active operation rather than controlled experiments.

That posture is a methodological choice, not an apology. The questions that matter for this class of system — can a persistent AI companion survive model replacement? what hardware tier is actually required for real-time interaction? what does memory architecture need to do to function in practice? — are answered by operating the system, not by simulating it.

2. Related Work

2.1 The Zeitgeist: OpenClaw and the Persistent Agent Moment

In January 2026, a tool called clawdbot (later OpenClaw) by developer @steipete circulated widely on X (formerly Twitter). The tool offered persistent memory, multi-platform presence, and autonomous behavior for individual users running local infrastructure. It was not studied as prior art for this project; it was encountered as cultural signal.

On January 23, 2026, the author encountered two pieces of vocabulary in the same X-sourced context: Christine Tyip describing her personal AI assistant as building her "second brain while I chat," and Aryeh Dubois reviewing clawdbot and listing "heartbeats" as a key capability alongside persistent memory and persona onboarding. Both terms — second brain and heartbeat — subsequently entered the author's working vocabulary and shaped architectural decisions. The timestamp matters: the vocabulary preceded the implementation by weeks in both cases.

2.2 Personal Knowledge Management Lineage

The memory architectures explored in this project draw on a longer tradition of personal knowledge management. Luhmann's Zettelkasten demonstrated that a sufficiently structured external note system could function as a cognitive partner rather than a storage medium. Andy Matuschak's formalization of evergreen notes updated this for digital practice. The persistent AI companion inherits from this tradition but differs in a key respect: it is not a system the operator writes into — it is a system that writes into itself, from observations it makes autonomously.

2.3 LongMemEval

LongMemEval is an evaluation framework for long-context memory recall in conversational agents. The benchmark injects synthetic conversation histories of 500+ turns across 50+ sessions into an agent's context and poses factual recall questions. It provides standardized measurement of whether an agent actually remembers what it has been told. This paper uses LongMemEval as the primary quantitative benchmark (Section 6).

3. Methodology

3.1 Research Site

The research site is a single-operator multi-agent household running on commodity hardware in Airway Heights, Washington. The operator is the author — 55 years old, no formal computer science training. The deployment consists of five always-on agents: Kato (orchestration), CJ Craig (writing), Mike (research and companion), Morty (publishing), and Sabrina (monitoring). All agents run as systemd user services on a Beelink EQI12 mini-PC (cha0tikhome). Inference runs on a CyberPowerPC gaming tower with an RTX 5060 Ti GPU (cha0tiktower).

3.2 Change-One-Variable Discipline

Each infrastructure phase changed one architectural variable while holding others constant. Phase transitions were motivated by operational failure or capability requirement, not experimental design. This produced a natural longitudinal record of which variables matter independently.

3.4 Vocabulary Acquisition as Research Output

Concepts in this project arrived with words. The term heartbeat — for the autonomous periodic activity cycle of a persistent agent — entered the design vocabulary from an X post in January 2026. The terms decay, consolidation, contradiction resolution, and provenance arrived during reading in the PKM and agent memory literature. In each case, having the word preceded and enabled the implementation. This is reported as a methodological finding rather than an anecdote.

4. The Three-Tier Hardware Framework

4.1 Tier 1: The DDR5 Play (~$500–800)

Representative hardware: mini-PC with AMD Ryzen 7 8745HS, 32GB DDR5, integrated Radeon 780M GPU. The key capability at this tier is DDR5 bandwidth. MoE models activate only a fraction of parameters per token; the bottleneck is memory bandwidth, not total VRAM. Suitable for asynchronous workloads — research digests, scheduled summaries, background reasoning — where latency tolerance is high. Not suitable for real-time conversational interaction.

4.2 Tier 2: PC Modularity (~$1,500–2,500)

Representative hardware: gaming PC with a mid-range discrete GPU. The RTX 5060 Ti 16GB at approximately $400–450 in an $1,800–2,200 complete build is the current entry point. A 16GB VRAM card runs a 26B Q4_K_M model entirely in VRAM. Generation throughput at this tier (74 tok/s in this deployment) supports real-time conversational interaction. Two cards project to 140–150 tok/s. The modularity argument is as important as the specs: discrete GPU upgrades do not require replacing the host system.

4.3 Tier 3: Apple Unified or High-End PC ($2,500+)

Representative hardware: Apple Silicon Mac Studio M4 Ultra (up to 192GB unified memory) or a high-VRAM PC workstation. This tier runs frontier-adjacent models — 70B+ dense at Q4, or full-precision smaller models. Apple Silicon's unified memory eliminates the PCIe transfer bottleneck but is non-expandable after purchase. For most persistent companion deployments at current scale, Tier 2 is sufficient.

5. Phase Timeline

Phase 0: In-Platform Memory Attempts (Late 2025)

Early attempts used Claude.ai's built-in memory features as a persistence layer. Failure mode: latency. Four minutes to first token on a MacBook Air M4. Dependency broken: platform-managed memory is not architecture.

Phase 1: Agents on Virtual Server, Cloud Inference (Early 2026)

Agent execution externalized to a cloud VPS. Inference called external APIs. Established the agent-as-process pattern. Limitation: cost and API availability. Dependency broken: cloud API inference is not a foundation.

Phase 2: Local Orchestration, Claude SDK Cloud Inference (February–March 2026)

Agent execution migrated to cha0tikhome. Inference remained cloud-based. This phase demonstrated that agent personality and behavioral consistency were not properties of the inference substrate. Mike, running on Claude Sonnet, behaved consistently with Mike running on OpenRouter models. The character was in the memory architecture, not the model. Dependency broken: inference provider is not identity.

Phase 3: Local Orchestration, OpenRouter Model-Neutral Inference (March–April 2026)

Inference migrated to OpenRouter, enabling model-neutral operation. Mike transitioned through Claude → GLM-4.7-Flash → SuperGemma4-26B → Gemma4-26B. Behavioral continuity maintained across all transitions. Dependency broken: model selection is an operational variable, not an architectural commitment.

Phase 4: Fully Local Inference (April 2026, Present)

RTX 5060 Ti arrived April 17, 2026. Llama-server on cha0tiktower replaced OpenRouter as primary inference on April 20. All heartbeat activity, autonomous research threads, and interactive responses now run on local hardware. Benchmark at Phase 4 entry: 74 tok/s — same model that took 4 minutes per token in Phase 0. Dependency broken: cloud inference is not required.

6. Operational Data

6.1 CPU Inference: Seventeen Daily Logs

Daily benchmark script ran on cha0tikhome from April 2 through April 18, 2026, logging inference performance under operational conditions.

Metric	Value	Notes
Sample size	17 daily logs	Continuous operation
Model	Gemma 4 26B Q4_K_M	Two variants tested
Avg generation throughput	10.49 tok/s	Operational conditions
Throughput range	5.02 – 11.72 tok/s	Low outliers on high-load days
Time to first token	616 ms – 2,224 ms	High variance
Swap used	4.6 – 7.7 GiB	Consistently hitting swap

The swap figure is the operationally significant finding. Co-locating five agents and local inference on the same machine creates memory pressure. Under load, the machine was producing tokens at 5 tok/s — barely interactive.

6.2 GPU Inference: RTX 5060 Ti Benchmark

Metric	Value
Model	Gemma 4 26B Q4_K_M
Generation throughput	74 tok/s
VRAM used	15,381 MiB / 16,311 MiB
GPU temperature under load	51°C
Swap used	0
Speedup vs CPU baseline	6.6×

Zero swap. The architectural separation — orchestration on cha0tikhome, inference on cha0tiktower — eliminates the co-location memory pressure entirely.

6.3 LongMemEval

Configuration	Examples	Correct	Accuracy
Baseline (no injection)	25	0	0%
Context-window mode (best run)	25	21	84%
Context-window mode (batched)	~90	~72	~80%

The baseline result (0/25) is not a failure — it is correct behavior. Mike answered "I don't know" to every question when given no history. An agent that confabulates is a more serious problem than one that honestly reports ignorance. The 84-point delta between modes is the direct contribution of the memory system.

The extraction pipeline is non-functional. The extraction model returned empty JSON from 80,000+ character conversation segments. Silent failure — no exception, no error logged, no facts stored. This is the highest-priority open engineering problem in the current deployment.

7. Architecture Dominance Finding

Mike transitioned through five inference backends: Claude Sonnet, Claude Haiku, GLM-4.7-Flash, SuperGemma4-26B-Uncensored, and Gemma4-26B-APEX-Balanced. Each transition was operationally motivated. Across all transitions, Mike maintained behavioral consistency — research thread style, communication patterns, characteristic reasoning — without modification to the memory layer.

This supports a strong claim: in a well-designed persistent companion architecture, inference substrate and agent identity are architecturally decoupled. The model answers each request. The architecture — the memory system, the behavioral constraints, the operational context — determines who is asking and what the answer means.

The model is closer to an instrument than an identity. The continuity that spans turns is the memory architecture. An architecture built on this understanding is insulated from model market churn — the architectural investment that pays the most durable return is the memory layer, not the model selection.

8. Vocabulary as Architecture

Named concepts get implemented. Unnamed concepts do not. The terms heartbeat, second brain, decay, consolidation, contradiction resolution, and provenance each arrived with a word and were subsequently implemented. The extraction pipeline failure (Section 6.3) is partially a vocabulary failure — the operator lacks a precise working vocabulary for the chunking, embedding, and retrieval operations that functional extraction requires.

The implication: for a self-taught operator building novel infrastructure, reading the field is not background work. It is architecture work with direct, traceable, often dateable consequences. The return on reading is architectural.

9. Limitations

Single operator, no peer validation. Self-reported operational data with no independent verification. Hardware-specific findings that will age. Vocabulary bias — unnamed problems in this paper are invisible by construction. Non-programmer architect posture that surfaces operational findings and obscures formal analysis.

10. Discussion and Future Work

The commodity hardware thesis is structural, not momentary. As inference becomes cheaper and more accessible, the architectural layer becomes the durable investment. The agent that survives the next model generation is the one whose identity lives in the architecture, not the weights.

Open problems: extraction pipeline repair, Frank benchmark execution against GPU inference tier, LongMemEval rerun (expected latency reduction from 84s to ~15s), contradiction resolution implementation.

11. Acknowledgments

This paper was drafted with substantial AI assistance. Claude (Anthropic) served as the primary tool for drafting, structural analysis, and revision. This is disclosed in the same spirit as disclosing any significant instrument: the tool changed the work; the authorship is unchanged.

No institutional funding. No employer involvement. This work was conducted independently, on personal hardware, during the evenings and weekends of a person who works a day job that will likely be automated in the next 12–18 months. That context is not incidental to the research — it is the research site.

References

[1] C. Tyip (@christinetyip), X (Twitter), Jan. 23, 2026. https://x.com/christinetyip/status/2010776377931575569

[2] A. Dubois (@AryehDubois), X (Twitter), Jan. 2026. https://x.com/AryehDubois/status/2011742378655432791

[3] Mem0, "Mem0: The Memory Layer for Personalized AI," mem0.ai, 2025. https://mem0.ai

[4] N. Luhmann, Kommunikation mit Zettelkästen, Westdeutscher Verlag, 1981.

[5] A. Matuschak, "Evergreen notes," andymatuschak.org, 2019. https://notes.andymatuschak.org/Evergreen_notes

[6] T. Forte, Building a Second Brain, Atria Books, 2022.

[7] X. Wu et al., "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory," arXiv:2410.10813, Oct. 2024. https://arxiv.org/abs/2410.10813

Preprint. Not peer reviewed. Raw benchmark data at dinovitale.com/benchmarks.html. Submitted to Zenodo for DOI assignment, April 2026. Dino Vitale — ORCID 0009-0001-5590-3296