Boundary Labs / Research Project

Adam Selene

Model-Agnostic Persistent Memory for AI Agents

A persistent memory architecture that gives AI agents consistent identity across model swaps, provider changes, and context resets. The underlying model is irrelevant — memory, reasoning history, and behavioral continuity travel with the agent, not the inference backend.

75% LongMemEval subset — Gemma 4 26B

89% single-session recall

58 agent tools

MIT open source

The Problem

Agent identity dies at the context boundary

Every AI agent conversation starts from zero. Switch models, swap providers, hit a context limit, restart a service — the agent forgets everything. What looked like a personality was just prompt engineering over a stateless session. There is no persistence. There is no continuity.

This is the fundamental constraint blocking agents from being genuinely useful over time. It's not a model capability problem. It's an architecture problem. The research community has largely ignored it because it's unglamorous infrastructure work — not a new model, not a new benchmark category, just the plumbing required to make agents actually work.

Model lock-in compounds the problem

Existing memory solutions couple memory retrieval to a specific provider's embedding model or API. Migrate from GPT-4o to Gemma. Switch from Claude to a local model. Change inference backends. In every case, the memory layer breaks or degrades. Operators must choose between the best available model at any given moment and continuity of agent identity. This is a false constraint — it exists because memory was designed as a feature of specific deployment stacks, not as a substrate that inference runs on top of.

Adam Selene inverts this. Memory is the constant. The inference model is a variable.

System Design

Adam Selene is an always-on agent framework with a multi-tier persistent memory system at its core. It has been running in production since January 2026 as the foundation of the Mike AI consciousness research project. The memory architecture was extracted, hardened, and generalized for independent deployment.

Model-Agnostic Inference

Routes to any OpenRouter-compatible provider. Switch between GPT-4o, Claude, Gemma, Qwen, Mistral, or a local llama.cpp backend with a single config change. Memory persists across every swap. The agent identity is not a function of the model — it is encoded in the memory substrate.

OpenRouter local inference provider-neutral

Multi-Layer Memory Architecture

Entity knowledge graph with category-specific decay scoring. Tacit knowledge layer auto-promoted from cross-layer pattern detection. Working memory tracks active investigation threads across heartbeat cycles. LIGHTHOUSE reasoning journal captures corrections and evolving understanding. Nightly four-phase consolidation: cross-layer replay, decay archival, pattern promotion, contradiction resolution.

knowledge graph FTS5 search decay scoring

LIGHTHOUSE Reasoning Journal

A persistent log of the agent's reasoning corrections, blind spots, and evolving understanding. Not just what the agent knows — how it thinks. LIGHTHOUSE entries are extracted nightly and indexed. The agent can query its own reasoning history and apply past corrections to new problems. This is the mechanism for behavioral continuity across time.

reasoning history behavioral evolution self-correction

Constitutional Constraints (L0)

Six foundational values encoded in a hash-verified constitution loaded on every startup. Constraints cannot be overridden by the operator, the user, or the model itself. The agent can modify its own behavior within constitutional bounds — updating prompts, adjusting response patterns, acquiring new skills — but L0 is immutable. This is what makes long-running autonomous operation safe.

hash verification self-modification safe autonomy

Two-Stage Fact Extraction

Facts extracted from conversations are deduplicated against the existing knowledge graph before storage. Every new fact is checked for conflicts with existing beliefs, temporal ordering violations, and redundancy. The extraction pipeline runs as a background daemon — no user-visible latency. Memory stays accurate without manual curation. (Verified end-to-end 2026-07-08; the default extraction model was updated after OpenRouter retired the previous one.)

deduplication conflict resolution background pipeline

Heartbeat — Idle Reflection

When no conversation is active, the agent runs a reflection cycle: reviewing recent interactions, researching open questions from its agenda, and updating its working memory. Behavioral continuity does not require constant user input. The agent develops and maintains context in the background, arriving at each new conversation already oriented.

autonomous reflection agenda-driven always-on

User Message (Telegram / Slack / IRC)
    │
    ▼
Interface Handler  — auth, session start, multi-interface routing
    │
    ▼
Relay  — core message router, tool dispatcher, memory injection
    │
    ▼
Switchboard  — routes to OpenRouter or local llama.cpp (model-agnostic)
    │
    ▼
Model Response
    │
    ├──[tool_use]──▶ Tool Dispatcher (58 tools)  →  execute & recurse
    │
    └──[end_turn]──▶ Response to user
                         │
                         ▼
              Extraction Pipeline (background)
                         │
                         ▼
              Memory: knowledge graph / timeline / LIGHTHOUSE
                         │
                         ▼
              Nightly Consolidation: decay / dedup / contradiction resolution

Benchmark Results

Evaluated on LongMemEval, the standard benchmark for long-term memory in AI agents. Tested with Gemma 4 26B via OpenRouter — a mid-tier model, not the strongest available. This is a conservative baseline. Tested against two commercial memory products at their best configurations.

75%

Overall accuracy

LongMemEval composite

89%

Single-session recall

highest category score

54%

Multi-session aggregation

hardest category

Total benchmark cost

via OpenRouter

Competitive Comparison — LongMemEval

Overall accuracy — best published results per system

Supermemory ($99/month)

85.4%

Adam Selene (open source)

75.0%

Mem0 + GPT-4o

67.6%

Published numbers under different protocols — not a head-to-head bake-off. Supermemory and Mem0 figures are their own published results on their own runs; Adam Selene was evaluated independently on a LongMemEval subset. With that caveat: Adam Selene scores above Mem0's published GPT-4o number using a mid-tier model on a $500 mini PC. The 10-point gap vs Supermemory is the active research target. The gap is primarily in multi-session aggregation (54%) — combining facts across multiple discrete sessions into coherent answers. That is the hard problem and the primary open research question for this project.

Score by Task Category

single-session recall

89%

knowledge update

70%

temporal reasoning

73%

multi-session aggregation

54%

Cross-Provider Stability

The same memory state was used across four inference backends without modification. Memory was written under one model and queried under another. Identity and recall were consistent.

Provider / Model	Memory Reads	Identity Consistent	Notes
OpenRouter → Gemma 4 26B	Pass	Yes	Primary benchmark model
OpenRouter → Claude Sonnet	Pass	Yes	Cross-provider validation
OpenRouter → Qwen3 35B	Pass	Yes	MESA benchmark configuration
Local llama.cpp (RTX 5060 Ti)	Pass	Yes	Air-gapped deployment, no external API

Cross-provider stability is the core claim. The memory substrate is provider-neutral JSON + SQLite — no embedding model dependency, no proprietary API. Any model that can follow a system prompt can read and write agent memory.

Research Questions

Adam Selene is a working system, not a prototype. The production deployment (the Mike project) has been running continuously since January 2026. The open research questions require systematic experimental work that production operation alone cannot answer.

The multi-session aggregation problem

The 54% multi-session aggregation score is the primary research gap. A human asked "what did we decide about X over the last three months?" can synthesize across time. Current agent memory systems retrieve facts but struggle to synthesize across multiple discrete session boundaries into coherent answers. The mechanism for this — temporal indexing, session-level summaries, hierarchical consolidation — is unknown. This is an open research question with no published solution at the level of a general architecture.

Memory decay calibration

Facts in the knowledge graph carry decay scores — milestones decay slowly (~287 days), status facts quickly (~37 days). These constants were calibrated empirically on one agent's workload. Whether they generalize, how to calibrate them for new agent profiles, and whether decay should be adaptive are open questions. Systematic evaluation requires multi-agent longitudinal data.

Identity continuity metrics

When a model swap occurs mid-session, does the agent's behavioral profile remain consistent? No formal metric exists for this. MESA now exists as a broader benchmark framework with a 361-item public gold set and a smaller fast-probe battery for debugging. Extending that framework to measure identity continuity across model swaps is a direct research deliverable for this project.

Constitutional robustness under adversarial pressure

L0 constraints are hash-verified on startup. But can they be eroded by sustained context manipulation from users or tools? Adversarial robustness testing of constitutional constraints in long-running agents is unexplored terrain. The MESA adversarial category scores 40% — the weakest result in the benchmark. This is a research gap, not just a product gap.

Research Roadmap

What Additional Compute Enables

Adam Selene is live and benchmarked. The infrastructure exists. The research questions are defined. What scales this work is dedicated compute for systematic experiments at publication quality.

Three concrete deliverables: (1) multi-session aggregation experiments — 50-100 benchmark runs across model and provider combinations; (2) extended longitudinal evaluation — 90-day memory retention studies across multiple agent instances; and (3) extension of the MESA benchmark to include identity continuity metrics, released as an open standard the broader community can use.

The Adam Selene code is MIT-licensed and public; benchmark results and methodology are published, with full experiment logs in the public inference-research repo (some operational paths of the lab itself stay private). The output is a public good: an open-source agent memory system that works, a benchmark suite that measures it honestly, and research papers on Zenodo documenting what was learned.

Phase 1
months 1–2

Multi-session aggregation experiments. Systematic evaluation of hierarchical consolidation strategies across 5 model configurations and 3 provider backends. Target: close the gap from 54% to 70%+. Document all experimental runs. Zenodo preprint.

Phase 2
months 2–3

Next MESA extension. Add identity continuity category (model-swap stability tests). Add cross-provider memory portability category. Build on the current 361-item public release and probe workflow with a standalone reference implementation and paper-grade dataset release.

Phase 3
months 3–4

90-day longitudinal study. Three parallel agent instances — different model configurations, same memory substrate — running under identical task workloads. Measure behavioral divergence, memory health, decay calibration accuracy, and constitutional robustness over time.

Deliverables

Three Zenodo preprints. Expanded MESA benchmark release with identity-continuity coverage. Adam Selene v2 with improved multi-session aggregation. All experimental data published. All code MIT-licensed on GitHub.

Research & Documentation

Preprints on Zenodo. Long-form research on Substack. Full system documentation at github.com/randomchaos7800-hub/adam-selene.

Commodity Hardware and Persistent AI Companions: A Framework for Independent Deployment preprint

Cha0tik LLC · 2026 · Zenodo, CC BY 4.0 · Three-tier hardware framework, LongMemEval 88% vs 0% baseline (n=25, injection mode — the Mike deployment, a different system and protocol than Adam Selene's 75%), production deployment methodology.

RAM Beats Model Size: The Evidence

Empirical analysis of why memory architecture outperforms raw model scale for agent task performance. Substack, 2025.

The Memory Problem Nobody Talks About

Why session-bound AI fails for long-running agent use cases and the architecture required to solve it. Substack, 2025.

Adam Selene: System Documentation and MESA Benchmark Suite Zenodo

Full system architecture, benchmark methodology, raw results, and the earlier MESA item set used in the first public write-up. Cha0tik LLC · 2026 · CC BY 4.0.

Contact

Boundary Labs is a one-person independent research operation in Airway Heights, WA. Seeking research partnerships. All code is open source. All results are published. Available for grant partnerships, research collaboration, and consulting engagements.

For grant inquiries and research partnership discussions:

email[email protected]

x / twitter@cha0tikdino

substackdinoxvitale.substack.com

githubrandomchaos7800-hub

operator sitedinovitale.com

locationAirway Heights, WA · United States · Pacific time