Sistava

How to Build an AI Agent That Actually Works (A Practical Guide for Builders)

Engineering — by Sistava

An honest, opinionated guide to building AI agents that survive contact with real work - the loop, the memory, the tools, the evals, and the parts most tutorials skip.

There are hundreds of tutorials that show you how to wire a language model to a couple of tools and call it an agent. Most of them produce something that demos beautifully in 30 seconds and collapses the moment a real user gives it a real, messy task.

This is the practical, battle-tested guide I wish I had eighteen months ago. It is not another framework tour. It is the actual architecture of a working agent - the loop, the dual memory systems, the tool surface, the decision patterns, the evals, and the production hardening details that almost nobody talks about. If you are about to build, read this first. If you have already built and it keeps breaking in production, this is almost certainly where it is breaking.

What an agent actually is

Strip the marketing and an agent is four things in a trench coat:

Everything else - multi-agent orchestration, sub-agents, planners, reflection - is a variation on one of those four. If you cannot ship a single-agent version of a feature, a multi-agent version of the same feature will be worse, just more expensive.

Part 1 - The loop

The smallest useful agent loop looks roughly like this:

That is ninety percent of every agent framework on the market. The complexity is not in the loop itself. The real engineering lives in how you build the tool schema, how you manage the growing message list, and the guardrails you add to stop the loop from destroying your budget or getting stuck.

The three early failure modes:

Step limits, token budgets, retry counts. These are not optional. They are the bumpers that keep the agent from setting your wallet on fire while you sleep.

Part 2 - Memory

This is where most agents quietly die.

You have two memories, and treating them like one is the most common bug in the space.

Short-term memory - the message list

This is what the LLM sees on each call. Every turn, every tool call, every tool result accumulates here. By step 10 of a complex task you are looking at 30,000 tokens of context, half of it stale tool output, and the model starts hallucinating because it cannot find the signal in the noise.

The fix is summarization with anchoring. When the message list exceeds a threshold, fold the middle into a structured summary while keeping the original goal, the latest few turns, and any explicitly bookmarked artifacts untouched.

The summary itself is just another LLM call against the messages-to-fold, with a tight prompt that demands facts and decisions, not narrative. The summarizer prompt is the most under-rated piece of agent infrastructure I know - get it right and a 50-step agent runs as cleanly as a 5-step one.

Long-term memory - the facts you keep

This is what survives across runs. It is also where everyone over-engineers.

You do not need a vector database on day one. You need a place to write down the things the model should remember about the user, the project, and previous decisions. A markdown file with a structured schema, scoped per user, beats a vector store for the first thousand users of almost any product. Reads are cheap, writes are explicit, and you can hand-edit it when the model gets a fact wrong.

The pattern is:

Vector retrieval becomes useful when (a) your notes outgrow the context window, or (b) you have heterogeneous sources (call transcripts, email threads, documents) where keyword/structure won't cut it. Until then, you are paying complexity tax for capability you do not yet need.

Part 3 - The tool surface

A good agent is mostly a good tool surface.

Three rules I keep relearning:

Rule 1: small surface, sharp tools. Twelve tools beats forty. Each tool name and description goes into the system prompt of every LLM call - they are tokens you pay for forever. If two tools do roughly the same thing the model will pick the wrong one about 30% of the time. Merge them, or hide one behind the other.

Rule 2: outputs in the model's language. Tool results are not JSON for machines, they are prose for the model. A tool that returns {"status": "ok", "data": [...]} is fine. A tool that returns a concise English summary plus structured data is much better - the model can immediately reason about it without spending a step on translation.

Rule 3: idempotency where possible. The loop will retry. Tools that are not idempotent (sending email, billing a card, creating a database row) need a deduplication key passed in by the caller, not generated inside the tool. If the model retries with the same key, the second call is a no-op. This single rule prevents most "agent sent the same email five times" disasters.

There are six tool categories you eventually need:

The mistake is to start with category 6 (multi-agent orchestration). Start with 1, 2, and 4 - a read, a write, and a search. Make that boring agent excellent before you reach for the fancy stuff.

Part 4 - Decision quality

The agent's job is to make decisions. Most "the agent isn't working" complaints are decision-quality problems disguised as infrastructure problems.

Three patterns I keep coming back to:

Plan, then act. For any task longer than three steps, force the agent to write a plan first. A bullet list, three to seven items, expressed in actions ("call X, then if Y, do Z"). The plan goes into the context for every subsequent step. Models are dramatically better executors when they have committed to a plan than when they are improvising step by step.

Reflect at checkpoints. Every N steps (or after any tool call that failed), force the agent to answer: "Am I making progress on the original goal? If not, what should change?" This adds one LLM call but catches the most expensive failure mode - agents who keep working hard on the wrong thing.

Constrain the action space. When the agent has just received a tool result, the next action should usually be one of: continue with the plan, update the plan, or ask the user a clarifying question. Most agents have an effectively infinite action space at every step - narrowing it through prompt structure cuts off whole classes of weird behavior.

Part 5 - Evals or it didn't happen

If you cannot answer "is the agent getting better or worse?" with a number, you do not have an agent - you have a demo.

Evals do not need to be fancy. They need to be frozen test cases run on every model or prompt change. A few dozen real examples, each with a known good answer or a check function, is enough to catch 80% of regressions. The pattern:

The four numbers that matter:

Trace inspection is the single highest-leverage tool an agent team has. You want to be able to click on any case, any decision, any tool call, and see the prompt, the response, the tokens, and the timing. Without that, you are debugging blind.

Part 6 - Production hardening

Stuff the tutorials never cover:

These are the things that turn a working demo into a working product. None of them are interesting. All of them are the difference between a system you can run and a system that runs you.

A word on choosing your stack

You will be tempted to start with a big framework. Resist for a week.

You have three practical paths. You can build everything from scratch (raw Python + LLM API), you can use a framework like LangGraph to handle the loop and state management at a higher level of abstraction, or you can skip the infrastructure entirely and use a complete platform.

Build the smallest end-to-end version in your own code first - 200 lines, one model, three tools, a flat memory file, three eval cases. You will learn more about how agents fail in those 200 lines than in a month of tutorial videos. Once you understand the pain points, decide: stay raw, add a framework like LangGraph for better stateful graphs and persistence, or move to a full platform.

For tool category choices, prefer:

Optimize for being able to read every line of code your agent runs through. The moment you can't, you have lost.

The two paths from here

If you are a builder who wants to learn this craft, build the dumb version yourself. Two hundred lines of Python, three tools, a flat notes file, a JSON eval set. You will understand agents better in a weekend than most people who ship agents into production ever do.

If you are a founder who wants the outcome - agents that do real work for you, with the loop, memory, tools, evals, budgets, audit trails, and human-in-the-loop already wired - you can skip the build. Sistava is what we built so you don't have to: every part of this article is in there already, configured, observed, and ready to be told what to do. You hire a team, give it a goal, and the agents run. The reasoning, the memory, the tool calls, the costs - all visible, all inspectable, all yours.

Either path is fine. The wrong path is the one in the middle: cobbling together a framework you didn't read, deploying it without evals, and discovering at 2 a.m. on a Tuesday that it has been sending the same email to your top customer for six hours.

Pick a path. Then ship something tiny. Then make it good.