Filled prompt template
Role, goals, constraints, voice, and tool list compiled from a reusable skeleton.
Question — — by Mahmoud Zalt
How AI agents build other AI agents in practice: prompt templates, tool wrappers, evals, versioning, and how Sistava ships this out of the box.
An agent that builds another agent is not magic: it is a loop where one LLM-driven process emits the four artifacts that define a working hire. Those artifacts are a system prompt (who the agent is and how it behaves), a tool list with input schemas (what it can do), a test set (what good output looks like), and a version tag (which build is live). The parent agent reads a brief, picks templates, fills slots, registers tools, generates evals, and writes a manifest. The child agent then runs against the evals before it sees a real user. The interesting part is that the parent agent uses the same loop on itself: when its scaffolding produces bad children, it updates its own templates and tool schemas. This recursive shape is why the meta-agent pattern matters more than any single new model release: the floor for agent quality keeps rising while you sleep, as long as the loop is sound.
Prompt templates are reusable skeletons with slots: role, goals, constraints, voice, escalation rules, and a list of allowed tools. The parent agent fills these slots from the brief, then runs the filled prompt through a linter that checks for banned patterns (leaking system text, contradicting constraints, role drift). Tool wrappers are the second half of the same picture: every tool the child agent can call is registered with a typed schema, a description the LLM reads, a permission scope, and a cost ceiling. The wrapper is the safety boundary, not the prompt. Frameworks like LangChain, LangGraph, and CrewAI give you primitives for both, and n8n gives you a visual surface for the same idea. The hard part is not the primitives: it is the registry that says which template version, which tool version, and which model version belong together. Without that registry, every fix breaks something else next week.
Role, goals, constraints, voice, and tool list compiled from a reusable skeleton.
Every tool registered with a schema, permission scope, and a budget ceiling.
Prompts plus expected behaviors that gate the build before it sees a real user.
Locked combination of prompt, tools, model, and tests with a rollback target.
Tracing wired into Langfuse or similar so each run is inspectable post-deploy.
Testing an agent is not the same as testing a function: outputs are stochastic, success often spans multiple turns, and the costliest failures are silent (the agent confidently does the wrong thing). The pattern that works is a layered eval set. Start with unit-style checks on tool calls (did the agent call the right tool with the right arguments), move to single-turn quality checks (does the response follow the persona, hit the format, stay within guardrails), then multi-turn scenario tests (does the conversation reach the goal in under N steps), and finally regression checks (does the new build still pass the eval set the previous build passed). Use a judge model for the qualitative checks, but pin the judge model and pin the rubric. Run the whole set on every prompt change, every tool change, and every model change. If you skip the regression layer, you will ship the same bug three times in three sprints.
The reason most teams stall here is not the test design, it is the cost of running the loop on every change. Real eval sets get expensive fast, and the temptation to cut corners (skip multi-turn, sample fewer prompts, downgrade the judge) compounds quietly. The clean way out is to cache deterministic test outputs, batch the LLM-judged ones, and treat the eval bill as a non-negotiable line in the infra budget. If you cannot run the test suite, you cannot ship a versioned release, and if you cannot ship a versioned release, you have a science project, not a product.
Now to the part of the problem most posts skip: versioning. A working agent is the combination of a specific prompt, a specific tool registry, a specific model, and a specific eval set. Changing any one of those four without bumping the bundle is how you wake up to a hire that quietly forgot how to do its job. The next section is the shape of versioning that actually holds up in production, and it is the same shape Sistava uses internally to keep hires stable across model swaps and skill updates.
Versioning matters because agents are bundles, not files. A hire that worked yesterday can break today because the model provider rolled out a new snapshot, a tool integration changed an argument name, a guardrail was tightened, or a teammate edited the persona. Without a versioned bundle, you cannot tell which change broke it and you cannot roll it back. The fix is straightforward in shape and hard in discipline: every release locks the prompt version, the tool registry version, the model name plus snapshot, the eval set version, and the guardrail policy version into one manifest with a single semver tag. Promotions go staging then canary then production, and every step runs the full eval set. Rollback is one command. Frameworks like LangChain, CrewAI, and LangGraph give you the runtime; you still have to bring the manifest, the registry, the canary policy, and the rollback path yourself.
Exact text plus template slots, with diffs reviewable like code.
Schemas, scopes, and budgets, all locked at the version tag.
Provider plus dated snapshot, not a moving alias that drifts under you.
The exact tests that gated the release, kept around for the next regression run.
If you have an engineering team and a quarter to spend, you can absolutely build this with CrewAI for the agent loop, LangGraph for state machines, LangChain or Llama Index for tooling, Langfuse for tracing, and a homegrown registry for versioning. n8n covers the visual workflow surface, and Apollo or Clay can plug into the data side. The honest tradeoff: you are building a platform, not a hire. The first hire ships in week six, the second in week seven, and the third never ships because someone has to maintain the registry. If you would rather skip that quarter, Sistava ships the meta-loop already wired: the AI Team Leader scaffolds the hire from a brief, picks templates from a curated catalog, registers tools with scopes and budgets, generates a baseline eval set, and tags every release into a manifest you can roll back. You bring the brief; the platform brings the platform.
Yes, but only inside a loop that includes templates, typed tool wrappers, an eval set, and a versioned manifest. Without those four, an agent generating another agent is a demo, not a workflow. The loop is what makes it durable.
You need one runtime to orchestrate the loop. CrewAI, LangChain, and LangGraph are all reasonable picks if you are building yourself. n8n covers the visual workflow case. Sistava bundles a runtime with the templates, registry, and versioning already wired.
Cache deterministic checks (tool-call assertions, schema validation), batch the LLM-judged ones, and pin the judge model. Run the full suite on releases and a smaller smoke set on every change. Treat the eval bill as fixed infra cost.
Model snapshot drift, silent tool schema changes, prompt edits without a version bump, and integrations changing auth or rate limits. All four are versioning failures, which is why the manifest matters more than any single template.
The platform runs an AI Team Leader that scaffolds hires from a brief, registers tools, generates a baseline eval set, and tags every release. You can inspect the manifest per hire. The honest caveat: the templates catalog is still growing, so very niche roles need a manual brief tune.
If you want the next layer down on how a meta-agent decides which template to pick, which tools to register, and how it handles a brief it has never seen before, the practical companion to this article walks through the routing logic and the failure modes that show up first. It is the read I wish I had handed myself eighteen months ago before I rebuilt the scaffolding loop three times.
The honest takeaway: agents that build agents are not a future capability, they are a pattern you can run today with the right plumbing. The four artifacts (prompt, tools, tests, version) are the whole game, and almost every production failure I have seen ties back to one of them being missing or unpinned. If you have an engineer and the runway, build the registry yourself and own it. If you would rather hire the first AI Employee this week and let the platform handle scaffolding, tests, and versioning, Sistava ships the loop already wired and lets you focus on the brief. Either path works. The pattern that does not work is treating an agent like a static prompt and hoping it stays good. It will not, and the manifest is how you know.