Working memory
What the agent holds inside one turn: the plan-so-far, the tool result that just returned. Lives in the native context window. No persistent backing store. It lives, it dies, it is gone.
Engineering — — by Mahmoud Zalt
How to architect AI agent memory that survives weeks in production: the layers, the storage shapes, the wiring traps, and the cost ceilings.
If you have built anything agentic, you have hit the wall. The agent forgets what the user told it last week. It re-introduces itself every morning. It picks the wrong tool even though you corrected it three days ago. The instinct is to throw a bigger context window at the problem. That instinct is wrong, and the research backs it: at 32K tokens, models already ignore roughly 70 percent of the information in the middle of the window. Memory is not a context-length problem. It is an architecture problem.
We operate an AI workforce in production at Sistava. These AI employees work for the same employer for weeks, hand work between sprints, get corrected, get retrained, and get asked the same kind of question two months apart. After enough runtime, the naive approach (pick a vector store, dump everything in, hope) stopped scaling. Memory in a long-running agent is at least seven concerns, and each one fails differently when it is missing. This is the developer's map of that architecture.
A vector store solves one problem well: fuzzy semantic retrieval over text chunks. It treats every chunk as an island. That is fine for a stateless chatbot whose entire memory problem is "what was said earlier in this thread." It falls apart the moment your agent has a life longer than one session.
Three failure modes show up immediately. Edges between chunks (who did what for whom, what came before what, which entity is the same as which other) have nowhere to live. Temporal facts (true last year, false today) get flattened into a single point in space with no notion of when they were valid. And there is no transactional consistency, so when two agents update the same shared knowledge at once, you get search, not coherence. The cognitive science literature already split memory into episodic, semantic, and procedural decades ago. The CoALA paper formalized that split for language model agents. Production forces you to take it seriously.
These are not a stack you traverse top to bottom. They run in parallel, and the hard part is wiring them so they do not contaminate one another. Here is each layer, what it holds, and the trap that comes with it.
What the agent holds inside one turn: the plan-so-far, the tool result that just returned. Lives in the native context window. No persistent backing store. It lives, it dies, it is gone.
The last N exchanges, persisted so the agent does not re-derive context every turn. A summariser middleware fires at a token threshold to compress the tail.
Past runs, especially the failures. Time-indexed transcript plus an LLM summary. Indexed by time, thread, and outcome, not just semantic similarity.
Stable facts about the user, business, and tools. Split into a hand-edited notebook (sovereign) and auto-extracted facts (noisy). When they disagree, the notebook wins.
Entities and the relationships between them, with temporal edges. Walk from 'the customer who churned' to 'the thread where they raised pricing' without re-reading everything.
How this agent does its work. Habits and skills it reproduces, stored as path-routed skill files. Behaviour, not facts.
The seventh layer sits underneath the other six. Checkpoints are a serialisable snapshot of where the agent was, deep enough that a forty-minute task killed at minute thirty-two resumes at minute thirty-three. Not the conversation history, which is layer two, but the agent's internal state: the current node in its workflow graph, the pending tool calls, the unwritten output. It is the difference between a background agent that crashes and starts over and one that survives a pod restart.
Not all of these belong in the same store, and not all of them want the same shape. The biggest mistake teams make is forcing all of memory into one substrate. Here is what each layer actually asks of you in production.
| Layer | Storage shape | Write trigger | Read pattern |
|---|---|---|---|
| Working | In-memory scratchpad | Per-turn | Native context window |
| Conversation | Append-only log plus summariser | Every message | Auto-loaded on each call |
| Episodic | Time-indexed transcript plus summary | After every message, background | Recency-weighted retrieval |
| Semantic, notebook | Single editable markdown file | Agent edits deliberately | Full text injected into prompt |
| Semantic, facts | Graph DB | Auto-extracted post-message | Entity-anchored search |
| Knowledge graph | Same graph DB, different edges | Auto-extracted with the facts | Walk entity to related node |
| Procedural | Markdown skill files, path-routed | Author or self-reflection | Loaded per task type |
| Checkpoints | KV, Postgres, or workflow engine | Every step | Resume on restart |
Conversation and checkpoints are happy in Postgres. The notebook, skills, and uploaded documents want a filesystem-style abstraction the agent can call edit and write on. Episodic and the knowledge graph share a graph database but write different label sets. Trained knowledge from external docs (Notion, Slack, Drive, URLs) runs a separate ingestion path into the same graph DB. At read time, all of it merges into a single context block before the agent ever sees it.
The expensive, valuable layer is the knowledge graph, and the value lives in the pipeline that feeds it. After every qualifying message, the system runs a sequence: entity extraction, deduplication against existing nodes, relationship inference, contradiction detection per new fact, and entity summary updates.
That is five to nine LLM calls plus two to four embedding calls per ingested message. Do not write this logic yourself. Several open-source libraries sit on top of a graph database and handle the extraction, deduplication, relationship inference, and contradiction detection. We would have spent six months building the temporal logic by hand. Pick a library, point it at your graph DB, and move on. The consolidation step is where most of the gain lives: merge memories above a similarity threshold, deduplicate, and you can cut storage by more than half while raising retrieval quality.
Naming seven layers is the easy part. Wiring them so they do not poison each other is where we lost the most time. Four rules earned their place the hard way.
Some layers are nearly free. The knowledge graph ingestion is not. Multiply five to nine LLM calls per message across every conversation across every employee, and this is the layer most likely to set fire to your runway. The math is brutal at scale: a thousand daily users at ten sessions each, prompting full context, runs well past thirty thousand dollars a month in input tokens alone. Memory done right is also a cost control, not just a quality feature, because retrieving two hundred relevant tokens beats injecting two hundred thousand.
A flag on the tenant row that disables auto-extraction entirely. Default off for trial accounts. Every unattended layer needs an off switch that does not require a redeploy.
Messages below a configurable character threshold never get ingested. Roughly 60 to 70 percent of tokens in a conversation are small talk or transient reasoning.
If the graph has no Employer entity yet (fresh tenant), skip the embedding and reranker calls. Saves a call on every message during the cold-start window.
Tag every memory-layer LLM call with a different action type from main-agent calls so the cost is visible in usage reports and attributable per layer.
If you would rather use this architecture than rebuild it, that is the whole point of the platform. The same layers described here run behind every AI employee you hire. The interface is narrower because the job is narrower, not because the memory is. You brief an employee like a junior teammate for a week, and by the third week you stop re-explaining things.
If we were starting over knowing what we know now, this is the sequence. Pick the layers before the framework, not the other way around.
Trained knowledge deserves its own ingestion path. Point the system at a website, a Notion space, a Drive folder, or a Slack workspace, and it crawls, digests, and folds the content into the same graph DB the conversation memory uses. From then on the agent is aware of it without searching the source again, and the training can re-run on a schedule to stay current. Two ingestion paths, one substrate, merged at read time.
It is the set of systems that let an AI agent store and recall information across turns and sessions, so it retains context, learns preferences, and avoids re-deriving the same facts every time. In a production agent it is not one database but several layers (working, conversation, episodic, semantic, knowledge graph, procedural, and checkpoints), each with its own write rule and storage shape.
A vector store gives you fuzzy semantic retrieval over text chunks and treats every chunk as an island. It has nowhere to store relationships between entities, no concept of when a fact was valid, and no transactional consistency when multiple agents update shared knowledge. You need a graph database for relational and temporal memory, plus episodic and procedural layers a vector store cannot represent.
Episodic memory is the agent's history of specific events: what it did last Tuesday, which run failed and why. It is time-indexed. Semantic memory is stable facts about the world: the company name, the preferred tools, the user's preferences. Semantic facts are edited in place; episodes accumulate and get evicted on a schedule. Keeping them separate prevents a brainstorm transcript from being treated as truth.
Ingestion runs roughly five to nine LLM calls plus two to four embedding calls per qualifying message, for entity extraction, deduplication, relationship inference, contradiction detection, and summary updates. At scale this is the layer most likely to dominate cost, which is why length-gated ingestion, cold-start short-circuits, and per-tenant kill switches matter. Done right, memory still saves money overall because retrieving a few hundred relevant tokens beats injecting the full context.
Buy the temporal and extraction logic. Several open-source libraries handle entity extraction, deduplication, relationship inference, and contradiction detection on top of a graph database. Hand-rolling the temporal logic is roughly six months of work that the libraries already solve. If you would rather not build any of it, the full stack runs behind every Sistava AI employee you hire.
Conversation memory is the message history the agent reloads each turn. Checkpoints are the agent's internal execution state: the current node in its workflow graph, pending tool calls, and unwritten output. Checkpoints let a long task survive a crash or pod restart and resume mid-run. Using the in-memory dev checkpointer in production is the classic mistake, because the first restart loses everything.
Memory is not a feature you bolt on at the end. It is the substrate of an agent that gets better at your work over time instead of starting over every Monday. Name the seven concerns, give each one the storage shape it asks for, wire them so they do not contaminate each other, and cap the cost before you turn it on. That is the entire architecture under everything an agent does between turns.