Sistava

AI Agent Memory: A Developer's Guide to the Stack

Engineering — by Mahmoud Zalt

How to architect AI agent memory that survives weeks in production: the layers, the storage shapes, the wiring traps, and the cost ceilings.

If you have built anything agentic, you have hit the wall. The agent forgets what the user told it last week. It re-introduces itself every morning. It picks the wrong tool even though you corrected it three days ago. The instinct is to throw a bigger context window at the problem. That instinct is wrong, and the research backs it: at 32K tokens, models already ignore roughly 70 percent of the information in the middle of the window. Memory is not a context-length problem. It is an architecture problem.

We operate an AI workforce in production at Sistava. These AI employees work for the same employer for weeks, hand work between sprints, get corrected, get retrained, and get asked the same kind of question two months apart. After enough runtime, the naive approach (pick a vector store, dump everything in, hope) stopped scaling. Memory in a long-running agent is at least seven concerns, and each one fails differently when it is missing. This is the developer's map of that architecture.

At a Glance

7
Distinct memory layers, not one store
5-9
LLM calls per message for graph ingestion
70%
Of mid-window context the model ignores at 32K tokens
90%+
Token savings vs full-context prompting

Why a vector store alone is not enough

A vector store solves one problem well: fuzzy semantic retrieval over text chunks. It treats every chunk as an island. That is fine for a stateless chatbot whose entire memory problem is "what was said earlier in this thread." It falls apart the moment your agent has a life longer than one session.

Three failure modes show up immediately. Edges between chunks (who did what for whom, what came before what, which entity is the same as which other) have nowhere to live. Temporal facts (true last year, false today) get flattened into a single point in space with no notion of when they were valid. And there is no transactional consistency, so when two agents update the same shared knowledge at once, you get search, not coherence. The cognitive science literature already split memory into episodic, semantic, and procedural decades ago. The CoALA paper formalized that split for language model agents. Production forces you to take it seriously.

The seven layers

These are not a stack you traverse top to bottom. They run in parallel, and the hard part is wiring them so they do not contaminate one another. Here is each layer, what it holds, and the trap that comes with it.

Benefits

Working memory

What the agent holds inside one turn: the plan-so-far, the tool result that just returned. Lives in the native context window. No persistent backing store. It lives, it dies, it is gone.

Conversation memory

The last N exchanges, persisted so the agent does not re-derive context every turn. A summariser middleware fires at a token threshold to compress the tail.

Episodic memory

Past runs, especially the failures. Time-indexed transcript plus an LLM summary. Indexed by time, thread, and outcome, not just semantic similarity.

Semantic memory

Stable facts about the user, business, and tools. Split into a hand-edited notebook (sovereign) and auto-extracted facts (noisy). When they disagree, the notebook wins.

Knowledge graph

Entities and the relationships between them, with temporal edges. Walk from 'the customer who churned' to 'the thread where they raised pricing' without re-reading everything.

Procedural memory

How this agent does its work. Habits and skills it reproduces, stored as path-routed skill files. Behaviour, not facts.

The seventh layer sits underneath the other six. Checkpoints are a serialisable snapshot of where the agent was, deep enough that a forty-minute task killed at minute thirty-two resumes at minute thirty-three. Not the conversation history, which is layer two, but the agent's internal state: the current node in its workflow graph, the pending tool calls, the unwritten output. It is the difference between a background agent that crashes and starts over and one that survives a pod restart.

Where each layer wants to live

Not all of these belong in the same store, and not all of them want the same shape. The biggest mistake teams make is forcing all of memory into one substrate. Here is what each layer actually asks of you in production.

LayerStorage shapeWrite triggerRead pattern
WorkingIn-memory scratchpadPer-turnNative context window
ConversationAppend-only log plus summariserEvery messageAuto-loaded on each call
EpisodicTime-indexed transcript plus summaryAfter every message, backgroundRecency-weighted retrieval
Semantic, notebookSingle editable markdown fileAgent edits deliberatelyFull text injected into prompt
Semantic, factsGraph DBAuto-extracted post-messageEntity-anchored search
Knowledge graphSame graph DB, different edgesAuto-extracted with the factsWalk entity to related node
ProceduralMarkdown skill files, path-routedAuthor or self-reflectionLoaded per task type
CheckpointsKV, Postgres, or workflow engineEvery stepResume on restart

Conversation and checkpoints are happy in Postgres. The notebook, skills, and uploaded documents want a filesystem-style abstraction the agent can call edit and write on. Episodic and the knowledge graph share a graph database but write different label sets. Trained knowledge from external docs (Notion, Slack, Drive, URLs) runs a separate ingestion path into the same graph DB. At read time, all of it merges into a single context block before the agent ever sees it.

The extraction pipeline

The expensive, valuable layer is the knowledge graph, and the value lives in the pipeline that feeds it. After every qualifying message, the system runs a sequence: entity extraction, deduplication against existing nodes, relationship inference, contradiction detection per new fact, and entity summary updates.

That is five to nine LLM calls plus two to four embedding calls per ingested message. Do not write this logic yourself. Several open-source libraries sit on top of a graph database and handle the extraction, deduplication, relationship inference, and contradiction detection. We would have spent six months building the temporal logic by hand. Pick a library, point it at your graph DB, and move on. The consolidation step is where most of the gain lives: merge memories above a similarity threshold, deduplicate, and you can cut storage by more than half while raising retrieval quality.

The wiring problem

Naming seven layers is the easy part. Wiring them so they do not poison each other is where we lost the most time. Four rules earned their place the hard way.

  1. Episodic must not leak into semantic — If every line of yesterday's conversation gets extracted as a fact, the agent ends up believing a brainstorm transcript is the truth. Run extraction on summarised episodes, not raw transcripts, and raise the minimum-confidence threshold. Better to lose a fact than bake a wrong one into the graph.
  2. Conversation must not leak into the graph — The graph wants stable claims with provenance. Throwaway phrasing produces garbage edges that outlive the conversation that made them. Skip ingestion below a length threshold. A one-line 'thanks' is not worth a node.
  3. The notebook overrides extracted facts — Hand-edited semantic memory beats auto-extracted memory on conflict. The notebook is the operator's stake in the ground. If it says the company name is Sistava, no extracted edge should drag the agent back to the old name three weeks later.
  4. Checkpoints stay cheap — They have to fire on every step. If they get heavy, you stop firing them and lose the ability to resume long tasks. The fix is not fewer checkpoints, it is smaller ones. Store deltas, not full state, wherever you can.

Cost is the real limit

Some layers are nearly free. The knowledge graph ingestion is not. Multiply five to nine LLM calls per message across every conversation across every employee, and this is the layer most likely to set fire to your runway. The math is brutal at scale: a thousand daily users at ten sessions each, prompting full context, runs well past thirty thousand dollars a month in input tokens alone. Memory done right is also a cost control, not just a quality feature, because retrieving two hundred relevant tokens beats injecting two hundred thousand.

Benefits

Tenant-level kill switch

A flag on the tenant row that disables auto-extraction entirely. Default off for trial accounts. Every unattended layer needs an off switch that does not require a redeploy.

Length-gated ingestion

Messages below a configurable character threshold never get ingested. Roughly 60 to 70 percent of tokens in a conversation are small talk or transient reasoning.

Short-circuit cold start

If the graph has no Employer entity yet (fresh tenant), skip the embedding and reranker calls. Saves a call on every message during the cold-start window.

Separate billing line

Tag every memory-layer LLM call with a different action type from main-agent calls so the cost is visible in usage reports and attributable per layer.

If you would rather use this architecture than rebuild it, that is the whole point of the platform. The same layers described here run behind every AI employee you hire. The interface is narrower because the job is narrower, not because the memory is. You brief an employee like a junior teammate for a week, and by the third week you stop re-explaining things.

The build order we would take again

If we were starting over knowing what we know now, this is the sequence. Pick the layers before the framework, not the other way around.

Trained knowledge deserves its own ingestion path. Point the system at a website, a Notion space, a Drive folder, or a Slack workspace, and it crawls, digests, and folds the content into the same graph DB the conversation memory uses. From then on the agent is aware of it without searching the source again, and the training can re-run on a schedule to stay current. Two ingestion paths, one substrate, merged at read time.

Frequently asked questions

FAQ

What is AI agent memory?

It is the set of systems that let an AI agent store and recall information across turns and sessions, so it retains context, learns preferences, and avoids re-deriving the same facts every time. In a production agent it is not one database but several layers (working, conversation, episodic, semantic, knowledge graph, procedural, and checkpoints), each with its own write rule and storage shape.

Why is a vector store not enough for agent memory?

A vector store gives you fuzzy semantic retrieval over text chunks and treats every chunk as an island. It has nowhere to store relationships between entities, no concept of when a fact was valid, and no transactional consistency when multiple agents update shared knowledge. You need a graph database for relational and temporal memory, plus episodic and procedural layers a vector store cannot represent.

What is the difference between episodic and semantic memory?

Episodic memory is the agent's history of specific events: what it did last Tuesday, which run failed and why. It is time-indexed. Semantic memory is stable facts about the world: the company name, the preferred tools, the user's preferences. Semantic facts are edited in place; episodes accumulate and get evicted on a schedule. Keeping them separate prevents a brainstorm transcript from being treated as truth.

How expensive is knowledge-graph memory to run?

Ingestion runs roughly five to nine LLM calls plus two to four embedding calls per qualifying message, for entity extraction, deduplication, relationship inference, contradiction detection, and summary updates. At scale this is the layer most likely to dominate cost, which is why length-gated ingestion, cold-start short-circuits, and per-tenant kill switches matter. Done right, memory still saves money overall because retrieving a few hundred relevant tokens beats injecting the full context.

Should I build agent memory myself or buy it?

Buy the temporal and extraction logic. Several open-source libraries handle entity extraction, deduplication, relationship inference, and contradiction detection on top of a graph database. Hand-rolling the temporal logic is roughly six months of work that the libraries already solve. If you would rather not build any of it, the full stack runs behind every Sistava AI employee you hire.

How do checkpoints differ from conversation memory?

Conversation memory is the message history the agent reloads each turn. Checkpoints are the agent's internal execution state: the current node in its workflow graph, pending tool calls, and unwritten output. Checkpoints let a long task survive a crash or pod restart and resume mid-run. Using the in-memory dev checkpointer in production is the classic mistake, because the first restart loses everything.

Memory is not a feature you bolt on at the end. It is the substrate of an agent that gets better at your work over time instead of starting over every Monday. Name the seven concerns, give each one the storage shape it asks for, wire them so they do not contaminate each other, and cap the cost before you turn it on. That is the entire architecture under everything an agent does between turns.