Sistava is an AI workforce platform where solo founders hire AI employees to run their business around the clock. Each AI employee has a specific role like sales, marketing, or customer support, with real tool integrations, persistent memory, and the ability to work inside your existing apps like Slack, Gmail, and HubSpot.

What is an AI employee?

An AI employee is an autonomous AI agent with a defined role, persona, skill set, and tool access. Unlike a chatbot that only answers questions, an AI employee takes on recurring work like writing emails, qualifying leads, answering support tickets, and publishing content, and it works on its own around the clock without being prompted each time.

How is Sistava different from project management software?

Sistava is not project management software. You hire AI employees who do the work, not a tool that tracks work done by humans. Your AI employees run sales outreach, write marketing content, answer support tickets, and handle operations on their own, without constant supervision.

How much does Sistava cost?

Sistava has a free plan you can start without a credit card, plus paid plans that scale with how much work you hand to your AI employees. See the pricing page for current plans.

What can AI employees do on Sistava?

Your AI employees take on the recurring work that runs a business: qualifying and reaching out to leads, writing and publishing marketing content, answering support tickets, and handling day to day operations. Each one comes with a role and skill set, so it can start working the day you hire it.

Sistava is built for solo founders and small teams who need to run sales, marketing, support, and operations without hiring a full human team. It gives you the equivalent of a growth team you can hire in minutes.

How to Build an AI Agent That Actually Works (A Practical Guide for Builders)

Engineering — 2026-05-23 — by Mahmoud Zalt

An honest, opinionated guide to building AI agents that survive contact with real work - the loop, the memory, the tools, the evals, and the parts most tutorials skip.

There are hundreds of tutorials that show you how to wire a language model to a couple of tools and call it an agent. Most of them produce something that demos beautifully in 30 seconds and collapses the moment a real user gives it a real, messy task.

This is the practical, battle-tested guide I wish I had eighteen months ago. It is not another framework tour. It is the actual architecture of a working agent - the loop, the dual memory systems, the tool surface, the decision patterns, the evals, and the production hardening details that almost nobody talks about. If you are about to build, read this first. If you have already built and it keeps breaking in production, this is almost certainly where it is breaking.

What an agent actually is

Strip the marketing and an agent is four things in a trench coat:

A loop. A language model that receives a goal, decides on an action, calls a tool, reads the result, and decides what to do next. The loop runs until the goal is met or a budget runs out.
A memory. Two kinds, actually - short-term (the conversation context you build for each LLM call) and long-term (the facts you persist across runs).
A tool surface. A finite set of functions the model can call. The shape and quality of these tools dictates 70% of how good the agent is in production.
An evaluation harness. Some way to know when the loop is working, when it is degrading, and which change in the prompt or the tool surface caused the regression.

Everything else - multi-agent orchestration, sub-agents, planners, reflection - is a variation on one of those four. If you cannot ship a single-agent version of a feature, a multi-agent version of the same feature will be worse, just more expensive.

Part 1 - The loop

The smallest useful agent loop looks roughly like this:

That is ninety percent of every agent framework on the market. The complexity is not in the loop itself. The real engineering lives in how you build the tool schema, how you manage the growing message list, and the guardrails you add to stop the loop from destroying your budget or getting stuck.

The three early failure modes:

Infinite loops. The model calls the same tool with the same arguments forever. Mitigation: a deduplication check on (name, hashed-args) per run, plus a hard step limit. Step limit alone is not enough - a 20-step loop of the same tool call still costs you 20 LLM calls.
Cost blowouts. Each step is a full LLM call with the full message history. By step 15 your context is huge. Mitigation: token budget per run, hard kill above it, and aggressive context trimming (see memory section).
Tool errors swallowed. Tool throws an exception, you catch it, return "error" to the model, the model tries again - silently, forever, with no signal to you. Mitigation: every tool exception goes to your error tracker, every retry is counted, and three retries on the same (name, args) triggers a hard stop.

Step limits, token budgets, retry counts. These are not optional. They are the bumpers that keep the agent from setting your wallet on fire while you sleep.

Part 2 - Memory

This is where most agents quietly die.

You have two memories, and treating them like one is the most common bug in the space.

Short-term memory - the message list

This is what the LLM sees on each call. Every turn, every tool call, every tool result accumulates here. By step 10 of a complex task you are looking at 30,000 tokens of context, half of it stale tool output, and the model starts hallucinating because it cannot find the signal in the noise.

The fix is summarization with anchoring. When the message list exceeds a threshold, fold the middle into a structured summary while keeping the original goal, the latest few turns, and any explicitly bookmarked artifacts untouched.

The summary itself is just another LLM call against the messages-to-fold, with a tight prompt that demands facts and decisions, not narrative. The summarizer prompt is the most under-rated piece of agent infrastructure I know - get it right and a 50-step agent runs as cleanly as a 5-step one.

Long-term memory - the facts you keep

This is what survives across runs. It is also where everyone over-engineers.

You do not need a vector database on day one. You need a place to write down the things the model should remember about the user, the project, and previous decisions. A markdown file with a structured schema, scoped per user, beats a vector store for the first thousand users of almost any product. Reads are cheap, writes are explicit, and you can hand-edit it when the model gets a fact wrong.

The pattern is:

A persistent "notes" file the agent can read and write.
A pre-prompt step that injects the relevant section into the system prompt.
A post-step that lets the agent decide if anything from the conversation deserves a new line in the notes.

Vector retrieval becomes useful when (a) your notes outgrow the context window, or (b) you have heterogeneous sources (call transcripts, email threads, documents) where keyword/structure won't cut it. Until then, you are paying complexity tax for capability you do not yet need.

Part 3 - The tool surface

A good agent is mostly a good tool surface.

Three rules I keep relearning:

Rule 1: small surface, sharp tools. Twelve tools beats forty. Each tool name and description goes into the system prompt of every LLM call - they are tokens you pay for forever. If two tools do roughly the same thing the model will pick the wrong one about 30% of the time. Merge them, or hide one behind the other.

Rule 2: outputs in the model's language. Tool results are not JSON for machines, they are prose for the model. A tool that returns {"status": "ok", "data": [...]} is fine. A tool that returns a concise English summary plus structured data is much better - the model can immediately reason about it without spending a step on translation.

Rule 3: idempotency where possible. The loop will retry. Tools that are not idempotent (sending email, billing a card, creating a database row) need a deduplication key passed in by the caller, not generated inside the tool. If the model retries with the same key, the second call is a no-op. This single rule prevents most "agent sent the same email five times" disasters.

There are six tool categories you eventually need:

Read tools - fetch state from the world. Cheap, idempotent, retry-safe.
Write tools - change state in the world. Need dedup keys, audit logging, and ideally a draft-then-approve flow for high-stakes actions.
Compute tools - do something locally that does not touch the world (parse, summarize, classify). Often these are smaller LLM calls dressed as tools.
Search tools - retrieve from a corpus the model would not otherwise see (your notes, your docs, the web).
Browser / OS tools - full computer or browser control, when the work cannot be done through APIs alone.
Delegation tools - let the agent hand off a sub-task to another agent specialized for it. Use sparingly. Every delegation hop doubles your debugging surface.

The mistake is to start with category 6 (multi-agent orchestration). Start with 1, 2, and 4 - a read, a write, and a search. Make that boring agent excellent before you reach for the fancy stuff.

Part 4 - Decision quality

The agent's job is to make decisions. Most "the agent isn't working" complaints are decision-quality problems disguised as infrastructure problems.

Three patterns I keep coming back to:

Plan, then act. For any task longer than three steps, force the agent to write a plan first. A bullet list, three to seven items, expressed in actions ("call X, then if Y, do Z"). The plan goes into the context for every subsequent step. Models are dramatically better executors when they have committed to a plan than when they are improvising step by step.

Reflect at checkpoints. Every N steps (or after any tool call that failed), force the agent to answer: "Am I making progress on the original goal? If not, what should change?" This adds one LLM call but catches the most expensive failure mode - agents who keep working hard on the wrong thing.

Constrain the action space. When the agent has just received a tool result, the next action should usually be one of: continue with the plan, update the plan, or ask the user a clarifying question. Most agents have an effectively infinite action space at every step - narrowing it through prompt structure cuts off whole classes of weird behavior.

Part 5 - Evals or it didn't happen

If you cannot answer "is the agent getting better or worse?" with a number, you do not have an agent - you have a demo.

Evals do not need to be fancy. They need to be frozen test cases run on every model or prompt change. A few dozen real examples, each with a known good answer or a check function, is enough to catch 80% of regressions. The pattern:

The four numbers that matter:

Pass rate. Did the agent get the right answer?
Steps used. Did it take 4 steps or 18?
Cost. How much did each case cost in tokens + tool calls?
Trace URL. Where can a human go to inspect what the agent actually thought?

Trace inspection is the single highest-leverage tool an agent team has. You want to be able to click on any case, any decision, any tool call, and see the prompt, the response, the tokens, and the timing. Without that, you are debugging blind.

Part 6 - Production hardening

Stuff the tutorials never cover:

Per-tenant budgets. Each user / customer / project has its own daily and monthly LLM spend cap. The agent reads it from config, refuses to start a task that would exceed it, and tells the user clearly when it hits the wall. Without this, one runaway loop in one tenant burns your margin for the month.
Idempotency on inbound triggers. If your agent is triggered by a webhook or a schedule, that trigger can fire twice. Every inbound trigger carries an ID, every run records that ID, duplicates short-circuit.
Pause and resume. Long-running agents need to pause cleanly when they hit a "need human input" moment, and resume from where they left off when the human answers.
Background work isolation. If your agent kicks off a long-running job, that job runs in a separate worker with its own retry policy, not in the request path.
Audit trail. Every tool call, every model response, every cost number, captured in a queryable store. When a customer says "your agent did X yesterday, what happened?", you should be able to answer in 30 seconds, not 30 minutes.

These are the things that turn a working demo into a working product. None of them are interesting. All of them are the difference between a system you can run and a system that runs you.

A word on choosing your stack

You will be tempted to start with a big framework. Resist for a week.

You have three practical paths. You can build everything from scratch (raw Python + LLM API), you can use a framework like LangGraph to handle the loop and state management at a higher level of abstraction, or you can skip the infrastructure entirely and use a complete platform.

Build the smallest end-to-end version in your own code first - 200 lines, one model, three tools, a flat memory file, three eval cases. You will learn more about how agents fail in those 200 lines than in a month of tutorial videos. Once you understand the pain points, decide: stay raw, add a framework like LangGraph for better stateful graphs and persistence, or move to a full platform.

For tool category choices, prefer:

Hosted model providers over self-hosted at the start. Cost is dominated by quality and speed, not provider price.
Structured memory you can read with cat over vector databases you can only query.
Synchronous tool calls over async orchestration until you actually need parallelism.
One model family for now. Swap providers later when you have evals to prove the new one is better on your cases.

Optimize for being able to read every line of code your agent runs through. The moment you can't, you have lost.

The two paths from here

If you are a builder who wants to learn this craft, build the dumb version yourself. Two hundred lines of Python, three tools, a flat notes file, a JSON eval set. You will understand agents better in a weekend than most people who ship agents into production ever do.

If you are a founder who wants the outcome - agents that do real work for you, with the loop, memory, tools, evals, budgets, audit trails, and human-in-the-loop already wired - you can skip the build. Sistava is what we built so you don't have to: every part of this article is in there already, configured, observed, and ready to be told what to do. You hire a team, give it a goal, and the agents run. The reasoning, the memory, the tool calls, the costs - all visible, all inspectable, all yours.

Either path is fine. The wrong path is the one in the middle: cobbling together a framework you didn't read, deploying it without evals, and discovering at 2 a.m. on a Tuesday that it has been sending the same email to your top customer for six hours.

Pick a path. Then ship something tiny. Then make it good.