# AI Agent in Production: The 7-Layer Memory Architecture Behind Siatava

*Engineering — 2026-05-18 — by Mahmoud Zalt*

We operate an AI workforce of around a thousand employees in production. Memory underpins everything they do. After extensive runtime experience, it crystallizes into seven distinct concerns. Conflate them and the agent confidently forgets, or worse, fabricates, the one detail that mattered.

How do you make an AI agent actually remember.

It is the question we keep getting asked. Every variant of it. Why does it forget what I told it last week. Why does it re-introduce itself every morning. Why does it pick the wrong tool even though I corrected it three days ago. Why can I never get it to feel like it actually knows me.

We run an AI workforce of around a thousand AI employees in production across teams, marketing, sales, support, ops. These employees do not get the polite reset that a chatbot gets at the end of a session. They work for the same employer for weeks. They sit on long-running sprints. They get corrected, they get retrained, they get asked the same kind of question two months apart. Memory is the layer underneath everything they do.

After enough time operating them, the obvious answer, pick a vector store, dump everything in, hope for the best, stopped working. Memory in a long-running agent is not one thing. It is at least seven. Each fails differently when it is missing. This article is the version of the architecture we landed on, written for someone who is about to build the same thing and would prefer not to do it twice.

## Why a chatbot's memory is not enough

Most agent-memory advice on the internet is written for chatbots. A chatbot's memory problem is small: remember what was said earlier in this conversation, maybe a little user profile. The window is short. The stakes are low.

An AI employee operates differently. It works for the same employer for weeks. It hands work between sprints. It writes drafts that get reviewed and revised. It remembers that you do not like exclamation marks. It learns that one of your customers churned last quarter and the reason had nothing to do with price. It can be wrong about something on Monday and corrected on Tuesday, and on Wednesday it should not be confidently wrong again about the same thing.

That set of requirements does not fit inside one storage system. The academic literature was already here, the CoALA paper (Princeton, 2023) formalised the episodic / semantic / procedural split from cognitive science for language model agents. CoALA outlines modular memory components (working memory as a short-term scratchpad, plus long-term episodic for experiences, semantic for facts, and procedural for skills), structured action spaces, and decision-making loops.

## The seven layers

Each layer has its own write rule, its own lifetime, its own read path. They are not a stack. They run in parallel, and they have to be wired so they do not contaminate one another. This mirrors CoALA's emphasis on distinct memory modules to avoid interference while enabling retrieval, reasoning, and learning actions.

- Working memory, what the agent holds inside a single turn.
- Conversation memory, the last N exchanges, summarised as the window fills.
- Episodic memory, past runs, especially the ones that failed.
- Semantic memory, stable facts about the user, the business, the tools.
- Knowledge graph, how entities connect, so a two-week-old question lands on the right thread.
- Procedural memory, habits the agent has formed about its own work.
- Checkpoints, snapshots deep enough that a long task can be killed and resumed.

## 1. Working memory

What the agent is holding inside a single turn. The plan-so-far. The tool result that just came back. The intermediate reasoning the agent does not need to keep after the turn ends. Cheapest layer. Lives in the agent's own context window or in a per-turn variable somewhere in the runtime. In CoALA terms, this is the active symbolic variables persisting across a single decision cycle.

What we learned the hard way: do not let working memory leak. We had agents that were treating per-turn scratchwork as durable knowledge, writing pseudo-facts about themselves into long-term storage from a half-formed thought. The fix is a hard wall. Working memory has no persistent backing store, no auto-flush to anywhere else. It lives, it dies, it is gone.

### What works

- Keep it in the agent's native context, the model's own working buffer, not an external store.
- Treat tool outputs as scratch unless they are explicitly committed to a higher layer.
- Never auto-promote anything from working memory to long-term storage.

## 2. Conversation memory

The last few exchanges. The context the agent should not have to re-derive every turn. Skip this layer and the agent introduces itself every message, and at our scale (every employer talks to multiple employees, multiple times a day), that re-derivation cost is real.

The model will not roll its own context. Most modern agent frameworks ship a checkpointer that auto-loads thread history on every invocation. Pick from a handful of mature options, there are at least three solid choices in this space today. That is the minimum bar. The interesting part is what happens when the window fills.

**Summariser trigger.** Run a summariser middleware that fires when the live conversation crosses a token threshold. It compresses the old turns into a single system message and keeps the tail intact. The next call sees a shorter, denser context. Most agent middlewares ship something equivalent; the principle works regardless of framework.

Persisted in Postgres. The framework's checkpointer writes a thread row on every step and the agent reloads it on every invocation. The whole layer is essentially free until the summariser fires, and the summariser cost is bounded because it only runs at threshold crossings.

## 3. Episodic memory

Past runs. Past sprints. The successful ones and especially the failed ones. "Last Tuesday I tried to send this email, the SMTP server timed out, and I switched to the queue." The agent's history of itself. This aligns directly with CoALA's episodic memory for logging decision cycles and outcomes.

This is what most people skip when they bolt a vector store on and call it done. A vector store gives you fuzzy retrieval over text chunks. Episodic memory gives you the continuity of an agent that knows what it did yesterday. The two can share storage but the index has to be temporal, by time, by thread, by outcome, not just by semantic similarity.

The literature talks about four stages for episodic memory: encoding (capturing an event with context), retrieval (pulling relevant episodes back into working memory), consolidation (turning accumulated episodes into durable semantic knowledge), and eviction (managing what gets dropped when storage fills). We touch all four in production, and consolidation is by far the hardest one to get right, it is where episodic memory crosses over into semantic memory, and where the contamination problems start.

### How we store episodes

- Raw conversation transcript per exchange, compressed, keyed by thread + timestamp.
- LLM-generated summary alongside, for the queries that ask 'have I done this before'.
- Retention policy: rolling window. Episodes older than a configured horizon get truncated to summaries only.
- Eviction is a cron job, not something the agent decides at write time.

## 4. Semantic memory

Stable facts about the user, the business, the tools, the customers. Slow-changing. "The product is called Atlas." "The employer prefers terse replies." "The team uses Linear, not Jira." Edited in place, not appended. This corresponds to CoALA's semantic memory for factual, conceptual knowledge.

This layer has two halves that look the same but behave nothing alike. The first half is a file the agent edits by hand, a markdown notebook, scoped per employee, that grows by deliberate writes. The second half is facts an LLM extracts automatically after every conversation, stored as structured triples in a graph. Both are "semantic" in the academic sense. Operationally, they cannot share a store.

**The most useful split we made.** Separating the file the agent writes its own notes in from the facts an LLM extracts after every message. They look similar. They behave nothing alike. Mix them and the agent will confidently re-derive a wrong fact you have already corrected by hand. The notebook is sovereign. The extracted facts are noisy. If they disagree, the notebook wins.

The split also gives the human a place to intervene. When a customer tells us "the AI keeps getting my company name wrong," we can fix it in the notebook in one line, and the wrong fact in the extracted graph gets out-voted from then on. Without the split, the correction would compete with the noise on equal footing.

## 5. Knowledge graph

Entities and the relationships between them. So that when the employer asks a question two weeks later, the agent can walk from "the customer who churned" to "the email thread where they raised the pricing concern" without re-reading everything. This augments semantic memory with relational structure and temporal awareness.

A vector store cannot do this. A vector store treats every chunk as an island. Edges between chunks, who did what for whom, what came before what, which entity is the same as which other, those are what a graph database is for. Pick from three solid options: Neo4j, Memgraph, KuzuDB. The one we run is the one you will see in the screenshots throughout this piece.

The non-obvious part is temporal awareness. A fact is true at a point in time, not forever. "The employer works at Acme" was true last year, not now. A good knowledge graph layer tracks valid_at and invalid_at on every edge (or uses native temporal graph capabilities), so when today's conversation contradicts yesterday's extracted fact, the old edge gets marked invalid instead of deleted. The agent can answer "when did they switch jobs" without losing the audit trail. Recent guidance on temporal agents with knowledge graphs emphasizes this for real-world evolving contexts.

### Do not write the temporal logic yourself

Several open-source libraries sit on top of a graph DB and handle the LLM-driven extraction, deduplication, relationship inference, and contradiction detection for you. There are at least three worth evaluating in this space today. After every message, the pipeline fires entity extraction, deduplicates against existing nodes, extracts relationships, runs contradiction detection per new fact, and updates entity summaries. Five to nine LLM calls per ingested message, plus two to four embedding calls. Expensive, but it buys you a layer of cognition you cannot fake with retrieval-augmented chunks.

We would have spent six months writing the temporal logic ourselves if we had picked the build path. Strongly recommend the buy path. Pick a library, point it at your graph DB, move on.

If reading this from the buyer side rather than the builder side, the same knowledge graph layer is what makes our AI workforce hold a stable picture of your business across weeks. Pick a team below and the layered memory described above starts working on your behalf, same architecture, narrower interface.

## 6. Procedural memory

How this specific agent does its work. Patterns it has formed about itself. "When I write a marketing post, I draft, take a pass for tone, then a pass for length." "When I see a CSV in chat, I first check for header consistency." Not facts about the world. Habits the agent has learned. This draws from cognitive science and recent work like Mem^p on distilling trajectories into reusable skills and scripts.

Procedural memory lives in skill files. A skill is a markdown document the agent loads on demand when it recognises a relevant task. Steps to take, mistakes to avoid, format to follow. Some are authored by the operator. Some are written by the agent itself after a reflection step where it summarises what worked. Reflection-based updates have shown strong gains in benchmarks for procedural refinement.

The difference between procedural and semantic: semantic memory is content the agent recalls. Procedural memory is behaviour the agent reproduces. Do not mix them. A fact like "the employer prefers terse replies" is semantic. A pattern like "when drafting a reply, take a pass for length" is procedural. The first goes in the notebook. The second goes in a skill file.

## 7. Checkpoints

Underneath everything. A serialisable snapshot of where the agent was, deep enough that a forty-minute task can be killed at minute thirty-two and resume at minute thirty-three. Not the conversation history (that is Layer 2). The agent's internal state, the current node in its workflow graph, the pending tool calls, the unwritten output.

This is the difference between a background agent that crashes and starts over from scratch, and a long-running agent that survives a pod restart. The implementation is boring on purpose, a key-value table keyed by thread id, written once per step, restored on restart. The interesting part is making the writes cheap enough to happen on every step without dominating agent latency.

We run the agent workflows on Temporal. That gives us free checkpointing at every activity boundary. If you are not on a durable workflow engine, the agent framework you chose almost certainly ships a Postgres-backed checkpointer that does the equivalent for the graph state. The mistake people make is using the in-memory dev variant in production. The first time a pod restarts mid-task, you lose everything.

## Where each layer wants to live

Not all of these live in the same store. Not all of them want the same shape. A summary of what each layer asks of you in production:

| Layer | Storage shape | Write trigger | Read pattern |
|---|---|---|---|
| Working | In-memory scratchpad | Per-turn | Native context window |
| Conversation | Append-only log + summariser | Every message | Auto-loaded on each call |
| Episodic | Time-indexed transcript + summary | After every message (background) | Recency-weighted retrieval |
| Semantic, notebook | Single editable markdown file | Agent calls edit deliberately | Full text injected into prompt |
| Semantic, facts | Graph DB (Neo4j class) | Auto-extracted post-message | Entity-anchored search |
| Knowledge graph | Same graph DB, different edges | Auto-extracted with the facts | Walk from entity to related node |
| Procedural | Markdown skill files (path-routed) | Author or self-reflection | Loaded per task type |
| Checkpoints | KV / Postgres / workflow engine | Every step | Resume on restart |

## The wiring problem

Naming the seven layers is the easy part. Wiring them so they do not contaminate one another is where we lost the most time.

### Episodic must not leak into semantic

If every line of yesterday's conversation gets extracted as a "fact," the agent ends up believing a brainstorm transcript is the truth. Tighten the minimum-confidence threshold or run extraction on summarised episodes, not raw transcripts. Better to lose a fact than to bake a wrong one into the graph.

### Conversation must not leak into the knowledge graph

The graph wants stable claims with provenance. The conversation is full of throwaway phrasing. If you ingest every message verbatim, the graph fills with garbage edges that survive long after the conversation that produced them. Skip ingestion below a length threshold. A one-line "thanks" is not worth ingesting.

### The notebook overrides extracted facts

Hand-edited semantic memory beats auto-extracted semantic memory when they disagree. The notebook is the operator's stake in the ground. The extracted graph is the noisy long tail. If the notebook says "the company name is sistava," no extracted edge should be allowed to drag the agent back to "sista AI" three weeks later.

### Checkpoints stay cheap

Checkpoints have to fire on every step. If they get heavy, you stop firing them, and the agent loses the ability to resume a long task. The fix is not to checkpoint less. The fix is to make each checkpoint smaller. Store deltas, not full state, where you can.

## Cost is the real limit

Some of these layers are nearly free. Some can be ruinous if you let them run unbounded. Knowledge graph ingestion in particular takes five to nine LLM calls per message, entity extraction, deduplication, relationship extraction, contradiction detection per fact, entity summary update, plus two to four embedding calls. At our scale, multiplied across every conversation across every employee, this is the layer most likely to set fire to your runway.

### How we bound it

- Tenant-level kill switch. A flag on the tenant row that disables auto-extraction entirely. Default off for trial accounts.
- Length-gated ingestion. Messages below a configurable character threshold do not get ingested.
- Short-circuit search. If the graph has no 'Employer' entity yet (fresh tenant), skip the embedding + reranker calls entirely. Saves a call on every message during the cold-start period.
- Skip on failure. If the main agent execution failed, do not ingest. The episode is not worth committing.
- Skip on delegation. Employee-to-employee chat does not get ingested. Only employer-employee.
- Separate billing line. Every memory-layer LLM call is tagged with a different action type from main-agent calls so it is visible in usage reports.

**Kill switches save you.** Every layer that runs unattended needs a way to be turned off without redeploying. A flag in config. A per-tenant toggle. Anything. You will find out something is broken at the worst time, and the first thing you want to do is stop the bleeding. Bake the kill switch in before you turn the system on.

## What this looks like for an operator

Reading this from the operator side, the person who would hire AI employees rather than build them, what the seven layers actually buy you is an agent that gets better at your team's specific work over time. Memory that compounds. Preferences that stick. Mistakes that do not repeat. Context that carries from sprint to sprint.

A marketing employee remembers the brand voice you corrected last week. A sales employee remembers which customer raised which objection two months ago. A support employee remembers the workflow you walked them through on Monday. None of that happens with a vector store alone, and none of it is theoretical for us; the layers in this piece are what makes that behaviour show up reliably in production.

## What we actually use

Concrete stack, in case it is useful. None of this is locked in, we have already swapped pieces of it once and would not be surprised to swap more.

| Layer | Pick from (3 mature options) | Why this shape |
|---|---|---|
| Conversation + Checkpointer | Any modern agent framework with a persistent (not in-memory) checkpointer, there are at least three good ones | Free with the framework. Postgres is enough as the backing store. |
| Notebook + Skills + Documents | A filesystem-style abstraction the agent can call edit / write on, bundled with most agent frameworks, or roll a thin one over a path-routed KV store | Single substrate, agent treats it like a real filesystem. |
| Episodic + KG facts | A temporal-aware knowledge graph library on top of a graph DB, several open-source options today | Temporal edges. Contradiction detection. Months of work you do not want to do yourself. |
| Trained knowledge from external docs | A document-ingestion pipeline that builds a graph + vector index, several open-source options | Separate ingestion path for Notion / Slack / Drive / URLs. Re-uses the same graph DB. |
| Long-running tasks + checkpoints | Temporal (we use this one, it earns its keep), or a Postgres-backed checkpointer from your agent framework | Free checkpointing at every activity boundary. |
| Graph DB | Neo4j, Memgraph, KuzuDB | Same query model. Pick by ops preference. |
| Vector store | Qdrant, Weaviate, pgvector | All three handle the embedding sizes you will throw at them. pgvector wins on dependency simplicity if you are already on Postgres. |

Two separate ingestion paths feed long-term memory. One grows from conversation (what the agent learns while it works). The other grows from documents you upload (Notion / Slack / Drive / URLs). They share the graph DB as a substrate but write different label sets. At read time they are merged into a single context block before the agent sees it.

## What we would do if rebuilding tomorrow

If we were starting over knowing what we know now, here is the build order we would take.

- Pick the layers first. Do not pick a framework first. Map the seven concerns to your product before you write a line.
- Postgres for conversation history + checkpoints. Boring is fine.
- Path-routed key-value store for the notebook, skills, and documents, let the agent treat it like a filesystem.
- Graph DB for the knowledge graph, pick one of three: Neo4j, Memgraph, KuzuDB. Pair it with an off-the-shelf temporal KG library so contradictions invalidate old edges instead of deleting them.
- Vector store for trained-document embeddings, pick one of three: Qdrant, Weaviate, pgvector. pgvector is enough if you want to keep dependencies tight and you are already on Postgres.
- Temporal for any long-running task. The free checkpointing is worth the operational cost.
- Kill switches and per-tenant gates from day one. Not month two.

## The problems we have not solved yet

Memory is far from finished, even for us. A few open problems we are still working on:

- Forgetting properly. We have eviction policies for episodes, but choosing what to forget without losing important context is still mostly heuristic.
- Cross-employee memory. When two AI employees on the same team need to share what they have learned about the same customer, the current answer is a shared knowledge graph scoped by tenant. The cross-talk discipline is still being tuned.
- Memory that explains itself. The agent should be able to say 'I know this because I read it in your Notion on March 4.' We log the provenance, but surfacing it cleanly inside a response is harder than it sounds.
- Memory boundaries. The agent should not remember a one-off frustrated message as a stable fact. Getting the LLM extraction layer to be conservative without going silent is an ongoing tuning problem.

## Or just one of them, for now

Most people who need an AI employee do not need a full team on day one. They need one, a single assistant scoped to their work, learning their preferences, picking up their tools, remembering across weeks the way a junior hire would.

The same seven memory layers are at work behind the simpler assistant variant. Same notebook. Same episodic store. Same temporal-aware knowledge graph. The interface is narrower because the job is narrower, not because the memory is.

Hire one, brief them like a junior teammate for a week, and the memory layer does its job in the background. By the third week, you stop re-explaining things. That is the entire point of designing memory as seven concerns instead of one.

## One last thing

Memory is not a feature you bolt on. It is the substrate of an agent that gets better at your work over time instead of starting over every Monday.

Our AI employees  run on the layout in this piece. Two weeks in, the difference is not that the agent recalls more, it is that it stops re-asking. Preferences hold. Corrections stick. Old conversations resurface exactly when they should.

That is the entire point of treating memory as seven concerns instead of one.

## If you are building this yourself

If you have read all the way to here, you are probably thinking about building something like this for yourself or your team. The seven layers, the wiring, the cost ceilings, the kill switches, it is a lot to get right the first time.

We are happy to help. If you want someone to come in and design the memory architecture for your product or team, same patterns described in this piece, adapted to your stack, you can have a look at Sista AI's consulting.

Otherwise, if you would rather talk to an engineer directly, I am Mahmoud Zalt, the founder behind this platform, I take a small number of these conversations personally, and the engagement is whatever shape makes sense for the problem. You can reach me directly at zalt.me.

Either way, build the seven layers. Wire them properly. It is the architecture under everything an agent does between turns.

**Tags:** ai-agents, memory, ai-employees, architecture, agentic-ai, knowledge-graph, vector-store