# AI Agent Memory: A Developer's Guide to the Stack *Engineering — 2026-06-27 — by Mahmoud Zalt* How to architect AI agent memory that survives weeks in production: the layers, the storage shapes, the wiring traps, and the cost ceilings. **TL;DR.** AI agent memory is not one database. A long-running agent needs distinct layers (working, conversation, episodic, semantic, knowledge graph, procedural, checkpoints), each with its own write rule, lifetime, and read path. Dump everything in a vector store and the agent confidently forgets the one fact that mattered. We run this stack across an AI workforce at Sistava, and this is how the pieces fit. If you have built anything agentic, you have hit the wall. The agent forgets what the user told it last week. It re-introduces itself every morning. It picks the wrong tool even though you corrected it three days ago. The instinct is to throw a bigger context window at the problem. That instinct is wrong, and the research backs it: at 32K tokens, models already ignore roughly 70 percent of the information in the middle of the window. Memory is not a context-length problem. It is an architecture problem. We operate an AI workforce in production at Sistava. These AI employees work for the same employer for weeks, hand work between sprints, get corrected, get retrained, and get asked the same kind of question two months apart. After enough runtime, the naive approach (pick a vector store, dump everything in, hope) stopped scaling. Memory in a long-running agent is at least seven concerns, and each one fails differently when it is missing. This is the developer's map of that architecture. ## At a Glance - **7** Distinct memory layers, not one store - **5-9** LLM calls per message for graph ingestion - **70%** Of mid-window context the model ignores at 32K tokens - **90%+** Token savings vs full-context prompting ## Why a vector store alone is not enough A vector store solves one problem well: fuzzy semantic retrieval over text chunks. It treats every chunk as an island. That is fine for a stateless chatbot whose entire memory problem is "what was said earlier in this thread." It falls apart the moment your agent has a life longer than one session. Three failure modes show up immediately. Edges between chunks (who did what for whom, what came before what, which entity is the same as which other) have nowhere to live. Temporal facts (true last year, false today) get flattened into a single point in space with no notion of when they were valid. And there is no transactional consistency, so when two agents update the same shared knowledge at once, you get search, not coherence. The cognitive science literature already split memory into episodic, semantic, and procedural decades ago. The CoALA paper formalized that split for language model agents. Production forces you to take it seriously. ## The seven layers These are not a stack you traverse top to bottom. They run in parallel, and the hard part is wiring them so they do not contaminate one another. Here is each layer, what it holds, and the trap that comes with it. ## Benefits ### Working memory What the agent holds inside one turn: the plan-so-far, the tool result that just returned. Lives in the native context window. No persistent backing store. It lives, it dies, it is gone. ### Conversation memory The last N exchanges, persisted so the agent does not re-derive context every turn. A summariser middleware fires at a token threshold to compress the tail. ### Episodic memory Past runs, especially the failures. Time-indexed transcript plus an LLM summary. Indexed by time, thread, and outcome, not just semantic similarity. ### Semantic memory Stable facts about the user, business, and tools. Split into a hand-edited notebook (sovereign) and auto-extracted facts (noisy). When they disagree, the notebook wins. ### Knowledge graph Entities and the relationships between them, with temporal edges. Walk from 'the customer who churned' to 'the thread where they raised pricing' without re-reading everything. ### Procedural memory How this agent does its work. Habits and skills it reproduces, stored as path-routed skill files. Behaviour, not facts. The seventh layer sits underneath the other six. Checkpoints are a serialisable snapshot of where the agent was, deep enough that a forty-minute task killed at minute thirty-two resumes at minute thirty-three. Not the conversation history, which is layer two, but the agent's internal state: the current node in its workflow graph, the pending tool calls, the unwritten output. It is the difference between a background agent that crashes and starts over and one that survives a pod restart. ## Where each layer wants to live Not all of these belong in the same store, and not all of them want the same shape. The biggest mistake teams make is forcing all of memory into one substrate. Here is what each layer actually asks of you in production. | Layer | Storage shape | Write trigger | Read pattern | |---|---|---|---| | Working | In-memory scratchpad | Per-turn | Native context window | | Conversation | Append-only log plus summariser | Every message | Auto-loaded on each call | | Episodic | Time-indexed transcript plus summary | After every message, background | Recency-weighted retrieval | | Semantic, notebook | Single editable markdown file | Agent edits deliberately | Full text injected into prompt | | Semantic, facts | Graph DB | Auto-extracted post-message | Entity-anchored search | | Knowledge graph | Same graph DB, different edges | Auto-extracted with the facts | Walk entity to related node | | Procedural | Markdown skill files, path-routed | Author or self-reflection | Loaded per task type | | Checkpoints | KV, Postgres, or workflow engine | Every step | Resume on restart | Conversation and checkpoints are happy in Postgres. The notebook, skills, and uploaded documents want a filesystem-style abstraction the agent can call edit and write on. Episodic and the knowledge graph share a graph database but write different label sets. Trained knowledge from external docs (Notion, Slack, Drive, URLs) runs a separate ingestion path into the same graph DB. At read time, all of it merges into a single context block before the agent ever sees it. ## The extraction pipeline The expensive, valuable layer is the knowledge graph, and the value lives in the pipeline that feeds it. After every qualifying message, the system runs a sequence: entity extraction, deduplication against existing nodes, relationship inference, contradiction detection per new fact, and entity summary updates. That is five to nine LLM calls plus two to four embedding calls per ingested message. Do not write this logic yourself. Several open-source libraries sit on top of a graph database and handle the extraction, deduplication, relationship inference, and contradiction detection. We would have spent six months building the temporal logic by hand. Pick a library, point it at your graph DB, and move on. The consolidation step is where most of the gain lives: merge memories above a similarity threshold, deduplicate, and you can cut storage by more than half while raising retrieval quality. **Temporal edges, not deletes.** A fact is true at a point in time, not forever. Track valid_at and invalid_at on every edge. When today's conversation contradicts yesterday's extracted fact, mark the old edge invalid instead of deleting it. The agent can answer 'when did they switch jobs' and you keep the audit trail. Deleting on contradiction throws away exactly the data a temporal query needs. ## The wiring problem Naming seven layers is the easy part. Wiring them so they do not poison each other is where we lost the most time. Four rules earned their place the hard way. 1. **Episodic must not leak into semantic** — If every line of yesterday's conversation gets extracted as a fact, the agent ends up believing a brainstorm transcript is the truth. Run extraction on summarised episodes, not raw transcripts, and raise the minimum-confidence threshold. Better to lose a fact than bake a wrong one into the graph. 2. **Conversation must not leak into the graph** — The graph wants stable claims with provenance. Throwaway phrasing produces garbage edges that outlive the conversation that made them. Skip ingestion below a length threshold. A one-line 'thanks' is not worth a node. 3. **The notebook overrides extracted facts** — Hand-edited semantic memory beats auto-extracted memory on conflict. The notebook is the operator's stake in the ground. If it says the company name is Sistava, no extracted edge should drag the agent back to the old name three weeks later. 4. **Checkpoints stay cheap** — They have to fire on every step. If they get heavy, you stop firing them and lose the ability to resume long tasks. The fix is not fewer checkpoints, it is smaller ones. Store deltas, not full state, wherever you can. ## Cost is the real limit Some layers are nearly free. The knowledge graph ingestion is not. Multiply five to nine LLM calls per message across every conversation across every employee, and this is the layer most likely to set fire to your runway. The math is brutal at scale: a thousand daily users at ten sessions each, prompting full context, runs well past thirty thousand dollars a month in input tokens alone. Memory done right is also a cost control, not just a quality feature, because retrieving two hundred relevant tokens beats injecting two hundred thousand. ## Benefits ### Tenant-level kill switch A flag on the tenant row that disables auto-extraction entirely. Default off for trial accounts. Every unattended layer needs an off switch that does not require a redeploy. ### Length-gated ingestion Messages below a configurable character threshold never get ingested. Roughly 60 to 70 percent of tokens in a conversation are small talk or transient reasoning. ### Short-circuit cold start If the graph has no Employer entity yet (fresh tenant), skip the embedding and reranker calls. Saves a call on every message during the cold-start window. ### Separate billing line Tag every memory-layer LLM call with a different action type from main-agent calls so the cost is visible in usage reports and attributable per layer. **Bake in the kill switch before you turn it on.** Every layer that runs unattended will break at the worst possible time, and the first thing you will want to do is stop the bleeding without shipping a fix. A per-tenant toggle, a config flag, anything. Build the off switch before the on switch. If you would rather use this architecture than rebuild it, that is the whole point of the platform. The same layers described here run behind every AI employee you hire. The interface is narrower because the job is narrower, not because the memory is. You brief an employee like a junior teammate for a week, and by the third week you stop re-explaining things. ## The build order we would take again If we were starting over knowing what we know now, this is the sequence. Pick the layers before the framework, not the other way around. - Map the seven concerns to your product before writing a line of code. Layers first, framework second. - Postgres for conversation history and checkpoints. Boring is correct here. - A path-routed key-value or filesystem abstraction for the notebook, skills, and documents, so the agent treats it like a real filesystem. - A graph database paired with an off-the-shelf temporal knowledge-graph library, so contradictions invalidate old edges instead of deleting them. - A vector store for trained-document embeddings. pgvector is enough if you are already on Postgres and want to keep dependencies tight. - A durable workflow engine for long-running tasks, for free checkpointing at every activity boundary. Never the in-memory dev checkpointer in production. - Kill switches and per-tenant gates from day one, not month two. Trained knowledge deserves its own ingestion path. Point the system at a website, a Notion space, a Drive folder, or a Slack workspace, and it crawls, digests, and folds the content into the same graph DB the conversation memory uses. From then on the agent is aware of it without searching the source again, and the training can re-run on a schedule to stay current. Two ingestion paths, one substrate, merged at read time. ## Frequently asked questions ## FAQ ### What is AI agent memory? It is the set of systems that let an AI agent store and recall information across turns and sessions, so it retains context, learns preferences, and avoids re-deriving the same facts every time. In a production agent it is not one database but several layers (working, conversation, episodic, semantic, knowledge graph, procedural, and checkpoints), each with its own write rule and storage shape. ### Why is a vector store not enough for agent memory? A vector store gives you fuzzy semantic retrieval over text chunks and treats every chunk as an island. It has nowhere to store relationships between entities, no concept of when a fact was valid, and no transactional consistency when multiple agents update shared knowledge. You need a graph database for relational and temporal memory, plus episodic and procedural layers a vector store cannot represent. ### What is the difference between episodic and semantic memory? Episodic memory is the agent's history of specific events: what it did last Tuesday, which run failed and why. It is time-indexed. Semantic memory is stable facts about the world: the company name, the preferred tools, the user's preferences. Semantic facts are edited in place; episodes accumulate and get evicted on a schedule. Keeping them separate prevents a brainstorm transcript from being treated as truth. ### How expensive is knowledge-graph memory to run? Ingestion runs roughly five to nine LLM calls plus two to four embedding calls per qualifying message, for entity extraction, deduplication, relationship inference, contradiction detection, and summary updates. At scale this is the layer most likely to dominate cost, which is why length-gated ingestion, cold-start short-circuits, and per-tenant kill switches matter. Done right, memory still saves money overall because retrieving a few hundred relevant tokens beats injecting the full context. ### Should I build agent memory myself or buy it? Buy the temporal and extraction logic. Several open-source libraries handle entity extraction, deduplication, relationship inference, and contradiction detection on top of a graph database. Hand-rolling the temporal logic is roughly six months of work that the libraries already solve. If you would rather not build any of it, the full stack runs behind every Sistava AI employee you hire. ### How do checkpoints differ from conversation memory? Conversation memory is the message history the agent reloads each turn. Checkpoints are the agent's internal execution state: the current node in its workflow graph, pending tool calls, and unwritten output. Checkpoints let a long task survive a crash or pod restart and resume mid-run. Using the in-memory dev checkpointer in production is the classic mistake, because the first restart loses everything. Memory is not a feature you bolt on at the end. It is the substrate of an agent that gets better at your work over time instead of starting over every Monday. Name the seven concerns, give each one the storage shape it asks for, wire them so they do not contaminate each other, and cap the cost before you turn it on. That is the entire architecture under everything an agent does between turns. **Tags:** ai-agents, memory, architecture, knowledge-graph, vector-store, agentic-ai