Sistava

Hybrid Private and Public LLM Architecture for AI Employees

Guide — by Mahmoud Zalt

A hybrid private and public LLM architecture for AI Employees routes sensitive work to private models and high-skill work to public ones to balance accuracy, cost, and privacy.

What is a hybrid private and public LLM architecture?

A hybrid private and public LLM architecture is a workforce design where every AI Employee call is routed, at runtime, to either a privately hosted model or a public frontier API based on the type of work being done. The private side runs open-weight models (Llama, Mistral, Qwen, DeepSeek) on infrastructure you control, so the prompt never leaves your perimeter. The public side calls OpenAI, Anthropic, or Google when the task needs frontier-level reasoning, coding, or multi-step planning that smaller hosted models still cannot match. A routing layer decides which side handles which call using three signals: the data classification of the prompt, the skill ceiling the task needs, and the cost budget for that employee. Done well, the user never sees the routing, the sensitive data stays inside the private side, and the cost curve flattens because frontier calls are reserved for the few tasks that actually need them.

At a Glance

2 sides
Private hosted plus public frontier
3 signals
Data class, skill ceiling, cost budget
1 router
Single decision point per call
0 leakage
Sensitive prompts stay on private side

Why do AI Employees benefit from a hybrid setup instead of one model?

Single-model setups force a bad tradeoff on every call. Pick a public frontier model for everything and you pay frontier pricing on trivial tasks and pipe every prompt (including ones that contain customer names, contracts, or internal docs) to a third-party API. Pick a private hosted model for everything and you save money but cap the workforce at whatever the open-weight ceiling happens to be that quarter, which still trails the best public models on hard reasoning. A hybrid setup lets the cheap, safe path absorb the bulk of calls (summaries, drafts, classification, lookups, retrieval-augmented answers) while the expensive, capable path is reserved for the genuinely hard ones (complex planning, code generation, multi-document analysis). The AI Employee feels uniformly competent because the user never knows which side answered, and the founder feels the difference on the invoice and in the privacy posture.

Benefits

Cost matched to skill

Trivial calls go private and cheap. Frontier calls only when the task actually needs frontier skill.

Sensitive data contained

Anything touching customer records or internal docs is pinned to the private side by policy.

Quality stays near top

Hard work still gets frontier models, so the workforce ceiling matches whatever leads the leaderboard.

Resilience by default

If one provider has an outage, the router falls back to the other side without the user noticing.

Audit trail per call

Every call logs which model answered, why the router picked it, and what data classification it carried.

How does the routing layer actually decide?

The router is the part that earns its keep. In practice it is a small decision component sitting in front of every model call, not a separate AI Employee, because you do not want one more LLM thinking about the question before the real one answers. It reads three signals on every call and picks a destination. First, data classification: if the prompt contains anything tagged sensitive (customer PII, internal documents, billing data, credentials), the call pins to the private side and no public model sees it. Second, skill ceiling: tasks marked as reasoning-heavy, multi-step, or code-generation route to the strongest available public model unless the data class blocks it. Third, cost budget: each AI Employee has a monthly ceiling, and as it nears the cap, the router gradually shifts borderline calls from public to private until the budget resets. The decision happens in single-digit milliseconds and is fully logged.

  1. Tag the prompt — Inputs and retrieved context get a data classification: public, internal, sensitive, or restricted.
  2. Score the task — The router reads the skill the AI Employee called for (draft, summarize, plan, code, analyze).
  3. Check the budget — Each employee carries a monthly budget. The router knows how much room is left this period.
  4. Pick the side — Sensitive plus internal data pins private. High-skill plus public data goes to the strongest frontier model.
  5. Log the decision — Model, route, classification, cost, and latency are recorded for the audit and cost dashboards.

The architecture also has to handle the cases where the easy path is wrong. If a private model returns a low-confidence answer on a task that mattered, the router can escalate to the public side as a second pass, with the sensitive parts of the prompt scrubbed first. If a public call times out or rate-limits, the router falls back to the private side rather than failing the user. These are not exotic edge cases. They are the daily reality of running a workforce on top of someone else's API, and getting them right is the difference between an architecture and a slide deck.

Most founders evaluating this question are not asking it in the abstract. They want to know whether a hybrid setup actually changes the bill at the end of the month and whether their accountant, lawyer, or compliance officer will calm down once it is in place. The next two sections are the honest version of both answers, based on running the same architecture inside Sistava for real customers and on my own dogfood traffic.

How much does a hybrid LLM setup actually save?

Real savings depend on the shape of your workload, but the pattern is consistent. In my own logs at Sistava, roughly seventy to eighty percent of calls are routine work that a strong private model handles at near-parity with the frontier (drafting emails, summarizing threads, classifying tickets, retrieval question answering). Routing those to the private side typically cuts their per-call cost by an order of magnitude versus running them on a top public model. The remaining twenty to thirty percent of calls (real planning, hard reasoning, code, deep analysis) stay on public frontier models, which is where you want them. The net effect on a fully active AI Employee workload is a monthly bill that lands somewhere between thirty and sixty percent of the all-public number, with no measurable drop in quality on the work that actually shipped. The exact figure swings with your task mix, so treat that range as a starting hypothesis, not a guarantee.

Benefits

Drafting and rewriting

Email replies, doc edits, formatting. Easy work, big volume. Routes private. Saves most of the bill.

Retrieval QA

Answering from your docs with retrieved context. Private models do this well, no public exposure needed.

Complex planning

Multi-step plans, decomposing a vague goal into tasks. Frontier-only on public, no real shortcut yet.

Hard code or analysis

Generating non-trivial code or analyzing a messy dataset. Frontier public when data class allows.

What does this look like inside Sistava day to day?

Inside Sistava the hybrid setup is invisible to the founder using the product. You hire an AI Employee, you give it work, you see the answer. Underneath, every call goes through a model router that knows which AI Employee made the request, which skill it triggered, what data class the prompt and retrieved context carry, and how much budget that employee has left for the month. The router picks a side, the call runs, the result comes back through the same surface as any other answer. On the founder console you see a per-employee model breakdown, the cost per role, and the ratio of private to public calls, so the architecture is observable and not just claimed. If you want a specific employee to stay fully private (legal review, customer support over regulated data) you set a policy on that hire and the router respects it. Day to day, the founder reasons about people, not models.

Frequently asked questions

FAQ

Do public LLM providers actually use my prompts to train their models?

On their consumer products, sometimes yes. On their API plans for businesses (OpenAI, Anthropic, Google) the standard contract says they do not use API inputs or outputs to train models by default. A hybrid setup adds a second layer: sensitive prompts never reach those APIs in the first place, so the contractual promise is reinforced by an architectural one.

Is a private LLM as accurate as a frontier model?

On routine tasks (drafting, summarizing, classification, retrieval question answering) the best open-weight models running on dedicated hardware land close enough that most users cannot reliably tell them apart. On hard reasoning, complex code, and long multi-step planning, public frontier models still lead. The whole point of the hybrid pattern is to keep the private side on tasks where it actually competes.

What counts as sensitive data for the router?

In practice: anything that identifies a real person (names, emails, phone numbers, addresses), anything pulled from a customer record or internal document store, billing and payment details, credentials, and anything legal or contractual. The router treats all of that as pinned to the private side and logs every decision, so you can audit later.

Does the hybrid setup add latency?

The routing decision itself is fast (single-digit milliseconds) because it is a small policy check, not another LLM call. Total latency for the user is dominated by whichever model actually answers. In some cases private hosted models are faster than public APIs because there is no shared multi-tenant queue and the hardware is dedicated to your workload.

Can I force a specific AI Employee to use only one side?

Yes. In Sistava you can pin an AI Employee to fully private (for regulated workloads where the policy answer is simpler than the engineering answer) or to a specific public model (for a role you want at the absolute frontier). The router respects the policy on every call from that employee and the audit log shows it.

If you want to go one level deeper on how the routing decision actually plays out across models and providers, the next read covers the model-routing layer itself: which models we route to, how we benchmark them against each other, and how we keep the routing honest as new frontier and open-weight models ship every few months. It is the engineering companion to the privacy and cost framing in this article.

The honest framing for hybrid private and public LLM architecture is this: it is not a clever trick to save money, and it is not a marketing label for a single-vendor product. It is the only sane way I have found to run a real AI workforce that has to be cheap enough to operate at solo-founder scale, accurate enough on hard work to be trusted, and respectful enough of customer data to be defensible if someone asks. Each side of the architecture solves a problem the other side cannot. Pinning sensitive work to the private side is what makes the architecture safe. Letting frontier work go to public models is what keeps the workforce useful. Routing each call by data class, skill, and budget is what holds the two sides together. If you build it once and instrument it properly, you stop arguing about which model is best this week and start running a workforce that quietly picks the right one on every call.