Preprints, long-form research, and field notes from the lab. All work is independently funded and conducted. Raw data linked where available.
Published at dinoxvitale.substack.com. Field notes, operational findings, and research in progress.
In late 2025, running a 26-billion-parameter language model on a MacBook Air M4 required approximately four minutes to produce a first token. By April 2026, the same model ran at 74 tokens per second on a $2,000 commodity gaming PC. This paper documents what that transition means in practice for a specific class of system: persistent, always-on AI companions deployed by a single operator in a residential setting.
The central thesis is that architecture and compute are decoupled in this domain. A five-agent household running continuously for four months demonstrates that personality, behavioral consistency, and memory coherence survive complete inference substrate replacement — from cloud APIs to local CPU to local GPU — without modification to the memory layer. The compute changes; the agent does not.
We present three contributions: (1) a three-tier commodity hardware framework matching hardware class to workload type; (2) operational data from seventeen daily inference logs and a LongMemEval benchmark run against a deployed agent; and (3) a vocabulary-as-architecture finding — that the sophistication of a self-built memory system is bounded by the operator's working vocabulary for describing memory phenomena, and that vocabulary acquisition is therefore a legitimate form of architectural research.
Artifacts: benchmark data at dinovitale.com/benchmarks.html; architecture implementation in the Adam Selene repository.
The first time I tried to run a local language model in late 2025, it took four minutes to produce the word "hello." The model was Gemma 4 26B Q4_K_M. The hardware was a MacBook Air M4 with 32GB of unified memory — at the time, considered well-suited for local inference. The result was not.
Four months later, the same model runs at 74 tokens per second on a $2,000 gaming PC with an RTX 5060 Ti. The inference experience is real-time. A second identical GPU arrives tonight. The projected throughput is 140–150 tokens per second on approximately $3,800 of total hardware — less than the cost of a single year of API access at the usage volumes this system requires.
The hardware caught up. This paper documents what that actually means.
This is operational research, not a laboratory study. The research site is a single-operator residential AI deployment: five always-on agents running on a home server in Airway Heights, Washington, each with distinct roles, persistent memory, and continuous autonomous activity. The operator — the author — has no formal computer science training. The deployment has been running continuously since February 2026, through four distinct infrastructure phases. The data in this paper comes from system logs, benchmark runs, and observations made during active operation rather than controlled experiments.
That posture is a methodological choice, not an apology. The questions that matter for this class of system — can a persistent AI companion survive model replacement? what hardware tier is actually required for real-time interaction? what does memory architecture need to do to function in practice? — are answered by operating the system, not by simulating it.
In January 2026, a tool called clawdbot (later OpenClaw) by developer @steipete circulated widely on X (formerly Twitter). The tool offered persistent memory, multi-platform presence, and autonomous behavior for individual users running local infrastructure. It was not studied as prior art for this project; it was encountered as cultural signal.
On January 23, 2026, the author encountered two pieces of vocabulary in the same X-sourced context: Christine Tyip describing her personal AI assistant as building her "second brain while I chat," and Aryeh Dubois reviewing clawdbot and listing "heartbeats" as a key capability alongside persistent memory and persona onboarding. Both terms — second brain and heartbeat — subsequently entered the author's working vocabulary and shaped architectural decisions. The timestamp matters: the vocabulary preceded the implementation by weeks in both cases.
The memory architectures explored in this project draw on a longer tradition of personal knowledge management. Luhmann's Zettelkasten demonstrated that a sufficiently structured external note system could function as a cognitive partner rather than a storage medium. Andy Matuschak's formalization of evergreen notes updated this for digital practice. The persistent AI companion inherits from this tradition but differs in a key respect: it is not a system the operator writes into — it is a system that writes into itself, from observations it makes autonomously.
LongMemEval is an evaluation framework for long-context memory recall in conversational agents. The benchmark injects synthetic conversation histories of 500+ turns across 50+ sessions into an agent's context and poses factual recall questions. It provides standardized measurement of whether an agent actually remembers what it has been told. This paper uses LongMemEval as the primary quantitative benchmark (Section 6).
The research site is a single-operator multi-agent household running on commodity hardware in Airway Heights, Washington. The operator is the author — 55 years old, no formal computer science training. The deployment consists of five always-on agents: Kato (orchestration), CJ Craig (writing), Mike (research and companion), Morty (publishing), and Sabrina (monitoring). All agents run as systemd user services on a Beelink EQI12 mini-PC (cha0tikhome). Inference runs on a CyberPowerPC gaming tower with an RTX 5060 Ti GPU (cha0tiktower).
Each infrastructure phase changed one architectural variable while holding others constant. Phase transitions were motivated by operational failure or capability requirement, not experimental design. This produced a natural longitudinal record of which variables matter independently.
Concepts in this project arrived with words. The term heartbeat — for the autonomous periodic activity cycle of a persistent agent — entered the design vocabulary from an X post in January 2026. The terms decay, consolidation, contradiction resolution, and provenance arrived during reading in the PKM and agent memory literature. In each case, having the word preceded and enabled the implementation. This is reported as a methodological finding rather than an anecdote.
Representative hardware: mini-PC with AMD Ryzen 7 8745HS, 32GB DDR5, integrated Radeon 780M GPU. The key capability at this tier is DDR5 bandwidth. MoE models activate only a fraction of parameters per token; the bottleneck is memory bandwidth, not total VRAM. Suitable for asynchronous workloads — research digests, scheduled summaries, background reasoning — where latency tolerance is high. Not suitable for real-time conversational interaction.
Representative hardware: gaming PC with a mid-range discrete GPU. The RTX 5060 Ti 16GB at approximately $400–450 in an $1,800–2,200 complete build is the current entry point. A 16GB VRAM card runs a 26B Q4_K_M model entirely in VRAM. Generation throughput at this tier (74 tok/s in this deployment) supports real-time conversational interaction. Two cards project to 140–150 tok/s. The modularity argument is as important as the specs: discrete GPU upgrades do not require replacing the host system.
Representative hardware: Apple Silicon Mac Studio M4 Ultra (up to 192GB unified memory) or a high-VRAM PC workstation. This tier runs frontier-adjacent models — 70B+ dense at Q4, or full-precision smaller models. Apple Silicon's unified memory eliminates the PCIe transfer bottleneck but is non-expandable after purchase. For most persistent companion deployments at current scale, Tier 2 is sufficient.
Early attempts used Claude.ai's built-in memory features as a persistence layer. Failure mode: latency. Four minutes to first token on a MacBook Air M4. Dependency broken: platform-managed memory is not architecture.
Agent execution externalized to a cloud VPS. Inference called external APIs. Established the agent-as-process pattern. Limitation: cost and API availability. Dependency broken: cloud API inference is not a foundation.
Agent execution migrated to cha0tikhome. Inference remained cloud-based. This phase demonstrated that agent personality and behavioral consistency were not properties of the inference substrate. Mike, running on Claude Sonnet, behaved consistently with Mike running on OpenRouter models. The character was in the memory architecture, not the model. Dependency broken: inference provider is not identity.
Inference migrated to OpenRouter, enabling model-neutral operation. Mike transitioned through Claude → GLM-4.7-Flash → SuperGemma4-26B → Gemma4-26B. Behavioral continuity maintained across all transitions. Dependency broken: model selection is an operational variable, not an architectural commitment.
RTX 5060 Ti arrived April 17, 2026. Llama-server on cha0tiktower replaced OpenRouter as primary inference on April 20. All heartbeat activity, autonomous research threads, and interactive responses now run on local hardware. Benchmark at Phase 4 entry: 74 tok/s — same model that took 4 minutes per token in Phase 0. Dependency broken: cloud inference is not required.
Daily benchmark script ran on cha0tikhome from April 2 through April 18, 2026, logging inference performance under operational conditions.
| Metric | Value | Notes |
|---|---|---|
| Sample size | 17 daily logs | Continuous operation |
| Model | Gemma 4 26B Q4_K_M | Two variants tested |
| Avg generation throughput | 10.49 tok/s | Operational conditions |
| Throughput range | 5.02 – 11.72 tok/s | Low outliers on high-load days |
| Time to first token | 616 ms – 2,224 ms | High variance |
| Swap used | 4.6 – 7.7 GiB | Consistently hitting swap |
The swap figure is the operationally significant finding. Co-locating five agents and local inference on the same machine creates memory pressure. Under load, the machine was producing tokens at 5 tok/s — barely interactive.
| Metric | Value |
|---|---|
| Model | Gemma 4 26B Q4_K_M |
| Generation throughput | 74 tok/s |
| VRAM used | 15,381 MiB / 16,311 MiB |
| GPU temperature under load | 51°C |
| Swap used | 0 |
| Speedup vs CPU baseline | 6.6× |
Zero swap. The architectural separation — orchestration on cha0tikhome, inference on cha0tiktower — eliminates the co-location memory pressure entirely.
| Configuration | Examples | Correct | Accuracy |
|---|---|---|---|
| Baseline (no injection) | 25 | 0 | 0% |
| Context-window mode (best run) | 25 | 21 | 84% |
| Context-window mode (batched) | ~90 | ~72 | ~80% |
The baseline result (0/25) is not a failure — it is correct behavior. Mike answered "I don't know" to every question when given no history. An agent that confabulates is a more serious problem than one that honestly reports ignorance. The 84-point delta between modes is the direct contribution of the memory system.
Mike transitioned through five inference backends: Claude Sonnet, Claude Haiku, GLM-4.7-Flash, SuperGemma4-26B-Uncensored, and Gemma4-26B-APEX-Balanced. Each transition was operationally motivated. Across all transitions, Mike maintained behavioral consistency — research thread style, communication patterns, characteristic reasoning — without modification to the memory layer.
This supports a strong claim: in a well-designed persistent companion architecture, inference substrate and agent identity are architecturally decoupled. The model answers each request. The architecture — the memory system, the behavioral constraints, the operational context — determines who is asking and what the answer means.
The model is closer to an instrument than an identity. The continuity that spans turns is the memory architecture. An architecture built on this understanding is insulated from model market churn — the architectural investment that pays the most durable return is the memory layer, not the model selection.
Named concepts get implemented. Unnamed concepts do not. The terms heartbeat, second brain, decay, consolidation, contradiction resolution, and provenance each arrived with a word and were subsequently implemented. The extraction pipeline failure (Section 6.3) is partially a vocabulary failure — the operator lacks a precise working vocabulary for the chunking, embedding, and retrieval operations that functional extraction requires.
The implication: for a self-taught operator building novel infrastructure, reading the field is not background work. It is architecture work with direct, traceable, often dateable consequences. The return on reading is architectural.
Single operator, no peer validation. Self-reported operational data with no independent verification. Hardware-specific findings that will age. Vocabulary bias — unnamed problems in this paper are invisible by construction. Non-programmer architect posture that surfaces operational findings and obscures formal analysis.
The commodity hardware thesis is structural, not momentary. As inference becomes cheaper and more accessible, the architectural layer becomes the durable investment. The agent that survives the next model generation is the one whose identity lives in the architecture, not the weights.
Open problems: extraction pipeline repair, Frank benchmark execution against GPU inference tier, LongMemEval rerun (expected latency reduction from 84s to ~15s), contradiction resolution implementation.
This paper was drafted with substantial AI assistance. Claude (Anthropic) served as the primary tool for drafting, structural analysis, and revision. This is disclosed in the same spirit as disclosing any significant instrument: the tool changed the work; the authorship is unchanged.
No institutional funding. No employer involvement. This work was conducted independently, on personal hardware, during the evenings and weekends of a person who works a day job that will likely be automated in the next 12–18 months. That context is not incidental to the research — it is the research site.
[1] C. Tyip (@christinetyip), X (Twitter), Jan. 23, 2026. https://x.com/christinetyip/status/2010776377931575569
[2] A. Dubois (@AryehDubois), X (Twitter), Jan. 2026. https://x.com/AryehDubois/status/2011742378655432791
[3] Mem0, "Mem0: The Memory Layer for Personalized AI," mem0.ai, 2025. https://mem0.ai
[4] N. Luhmann, Kommunikation mit Zettelkästen, Westdeutscher Verlag, 1981.
[5] A. Matuschak, "Evergreen notes," andymatuschak.org, 2019. https://notes.andymatuschak.org/Evergreen_notes
[6] T. Forte, Building a Second Brain, Atria Books, 2022.
[7] X. Wu et al., "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory," arXiv:2410.10813, Oct. 2024. https://arxiv.org/abs/2410.10813