Cost matched to skill
Trivial calls go private and cheap. Frontier calls only when the task actually needs frontier skill.
Guide — — by Mahmoud Zalt
A hybrid private and public LLM architecture for AI Employees routes sensitive work to private models and high-skill work to public ones to balance accuracy, cost, and privacy.
A hybrid private and public LLM architecture is a workforce design where every AI Employee call is routed, at runtime, to either a privately hosted model or a public frontier API based on the type of work being done. The private side runs open-weight models (Llama, Mistral, Qwen, DeepSeek) on infrastructure you control, so the prompt never leaves your perimeter. The public side calls OpenAI, Anthropic, or Google when the task needs frontier-level reasoning, coding, or multi-step planning that smaller hosted models still cannot match. A routing layer decides which side handles which call using three signals: the data classification of the prompt, the skill ceiling the task needs, and the cost budget for that employee. Done well, the user never sees the routing, the sensitive data stays inside the private side, and the cost curve flattens because frontier calls are reserved for the few tasks that actually need them.
Single-model setups force a bad tradeoff on every call. Pick a public frontier model for everything and you pay frontier pricing on trivial tasks and pipe every prompt (including ones that contain customer names, contracts, or internal docs) to a third-party API. Pick a private hosted model for everything and you save money but cap the workforce at whatever the open-weight ceiling happens to be that quarter, which still trails the best public models on hard reasoning. A hybrid setup lets the cheap, safe path absorb the bulk of calls (summaries, drafts, classification, lookups, retrieval-augmented answers) while the expensive, capable path is reserved for the genuinely hard ones (complex planning, code generation, multi-document analysis). The AI Employee feels uniformly competent because the user never knows which side answered, and the founder feels the difference on the invoice and in the privacy posture.
Trivial calls go private and cheap. Frontier calls only when the task actually needs frontier skill.
Anything touching customer records or internal docs is pinned to the private side by policy.
Hard work still gets frontier models, so the workforce ceiling matches whatever leads the leaderboard.
If one provider has an outage, the router falls back to the other side without the user noticing.
Every call logs which model answered, why the router picked it, and what data classification it carried.
The router is the part that earns its keep. In practice it is a small decision component sitting in front of every model call, not a separate AI Employee, because you do not want one more LLM thinking about the question before the real one answers. It reads three signals on every call and picks a destination. First, data classification: if the prompt contains anything tagged sensitive (customer PII, internal documents, billing data, credentials), the call pins to the private side and no public model sees it. Second, skill ceiling: tasks marked as reasoning-heavy, multi-step, or code-generation route to the strongest available public model unless the data class blocks it. Third, cost budget: each AI Employee has a monthly ceiling, and as it nears the cap, the router gradually shifts borderline calls from public to private until the budget resets. The decision happens in single-digit milliseconds and is fully logged.
The architecture also has to handle the cases where the easy path is wrong. If a private model returns a low-confidence answer on a task that mattered, the router can escalate to the public side as a second pass, with the sensitive parts of the prompt scrubbed first. If a public call times out or rate-limits, the router falls back to the private side rather than failing the user. These are not exotic edge cases. They are the daily reality of running a workforce on top of someone else's API, and getting them right is the difference between an architecture and a slide deck.
Most founders evaluating this question are not asking it in the abstract. They want to know whether a hybrid setup actually changes the bill at the end of the month and whether their accountant, lawyer, or compliance officer will calm down once it is in place. The next two sections are the honest version of both answers, based on running the same architecture inside Sistava for real customers and on my own dogfood traffic.
Real savings depend on the shape of your workload, but the pattern is consistent. In my own logs at Sistava, roughly seventy to eighty percent of calls are routine work that a strong private model handles at near-parity with the frontier (drafting emails, summarizing threads, classifying tickets, retrieval question answering). Routing those to the private side typically cuts their per-call cost by an order of magnitude versus running them on a top public model. The remaining twenty to thirty percent of calls (real planning, hard reasoning, code, deep analysis) stay on public frontier models, which is where you want them. The net effect on a fully active AI Employee workload is a monthly bill that lands somewhere between thirty and sixty percent of the all-public number, with no measurable drop in quality on the work that actually shipped. The exact figure swings with your task mix, so treat that range as a starting hypothesis, not a guarantee.
Email replies, doc edits, formatting. Easy work, big volume. Routes private. Saves most of the bill.
Answering from your docs with retrieved context. Private models do this well, no public exposure needed.
Multi-step plans, decomposing a vague goal into tasks. Frontier-only on public, no real shortcut yet.
Generating non-trivial code or analyzing a messy dataset. Frontier public when data class allows.
Inside Sistava the hybrid setup is invisible to the founder using the product. You hire an AI Employee, you give it work, you see the answer. Underneath, every call goes through a model router that knows which AI Employee made the request, which skill it triggered, what data class the prompt and retrieved context carry, and how much budget that employee has left for the month. The router picks a side, the call runs, the result comes back through the same surface as any other answer. On the founder console you see a per-employee model breakdown, the cost per role, and the ratio of private to public calls, so the architecture is observable and not just claimed. If you want a specific employee to stay fully private (legal review, customer support over regulated data) you set a policy on that hire and the router respects it. Day to day, the founder reasons about people, not models.
On their consumer products, sometimes yes. On their API plans for businesses (OpenAI, Anthropic, Google) the standard contract says they do not use API inputs or outputs to train models by default. A hybrid setup adds a second layer: sensitive prompts never reach those APIs in the first place, so the contractual promise is reinforced by an architectural one.
On routine tasks (drafting, summarizing, classification, retrieval question answering) the best open-weight models running on dedicated hardware land close enough that most users cannot reliably tell them apart. On hard reasoning, complex code, and long multi-step planning, public frontier models still lead. The whole point of the hybrid pattern is to keep the private side on tasks where it actually competes.
In practice: anything that identifies a real person (names, emails, phone numbers, addresses), anything pulled from a customer record or internal document store, billing and payment details, credentials, and anything legal or contractual. The router treats all of that as pinned to the private side and logs every decision, so you can audit later.
The routing decision itself is fast (single-digit milliseconds) because it is a small policy check, not another LLM call. Total latency for the user is dominated by whichever model actually answers. In some cases private hosted models are faster than public APIs because there is no shared multi-tenant queue and the hardware is dedicated to your workload.
Yes. In Sistava you can pin an AI Employee to fully private (for regulated workloads where the policy answer is simpler than the engineering answer) or to a specific public model (for a role you want at the absolute frontier). The router respects the policy on every call from that employee and the audit log shows it.
If you want to go one level deeper on how the routing decision actually plays out across models and providers, the next read covers the model-routing layer itself: which models we route to, how we benchmark them against each other, and how we keep the routing honest as new frontier and open-weight models ship every few months. It is the engineering companion to the privacy and cost framing in this article.
The honest framing for hybrid private and public LLM architecture is this: it is not a clever trick to save money, and it is not a marketing label for a single-vendor product. It is the only sane way I have found to run a real AI workforce that has to be cheap enough to operate at solo-founder scale, accurate enough on hard work to be trusted, and respectful enough of customer data to be defensible if someone asks. Each side of the architecture solves a problem the other side cannot. Pinning sensitive work to the private side is what makes the architecture safe. Letting frontier work go to public models is what keeps the workforce useful. Routing each call by data class, skill, and budget is what holds the two sides together. If you build it once and instrument it properly, you stop arguing about which model is best this week and start running a workforce that quietly picks the right one on every call.