Streaming gateway
Websocket or SSE endpoint that holds the connection open and forwards tokens the instant they arrive from the model.
Guide — — by Mahmoud Zalt
A practical conversational AI architecture for low-latency SaaS pairs streaming inference with a live status page so customers feel speed and trust uptime.
Most teams underestimate how many hops sit between a user keystroke and the first streamed token. A typical SaaS chat path crosses the browser, a websocket, a backend dispatcher, a memory lookup, a tool-routing layer, the model provider, and back. Each hop adds ten to one hundred milliseconds, and the user perceives the total as one number: slow or fast. The honest part is that latency is not a single optimization, it is a budget you spread across every layer. I have rebuilt this pipeline three times for Sistava and the lesson is the same every round: cut the slowest hop first, measure again, and resist the temptation to micro-optimize the parts that already feel instant. The path from idle key to first token under 500ms is what users now expect, and anything north of a second reads as broken even when the answer is correct.
The shape that works in production has five layers that each own one job and refuse to leak into the next. A streaming gateway accepts the connection and holds it open. A dispatcher decides which agent or employee should handle the message. A router picks the model that is cheapest and fast enough for the task. A memory layer pulls only the snippets that matter, never the whole history. A tool layer executes side effects with timeouts that respect the user's patience. The mistake I see most often is bolting memory and tools into the dispatcher, which turns a 200ms decision into a 1.5s wait. Keep each layer thin, asynchronous, and replaceable. The whole stack should fit on one whiteboard, and any layer should be swappable without rewriting the one above or below it. That is what makes it survive a model change six months later.
Websocket or SSE endpoint that holds the connection open and forwards tokens the instant they arrive from the model.
Picks the cheapest viable model per task, falls back automatically when a provider slows down.
Vector plus graph search that returns only the few snippets the agent actually needs, never the whole history.
Tool calls run in parallel with strict timeouts, so one slow integration cannot freeze the whole reply.
A live page showing model latency, queue depth, and provider health so users can self-verify the speed they feel.
Latency work pays back the most when you measure first and cut the single slowest hop, then measure again. The recipe I use on Sistava is unglamorous on purpose, because it survives every model change and every provider outage. Start by tracing one real conversation end to end with timestamps at every layer. The slowest hop is almost never the model itself, it is usually a memory query that pulls too much, a sync tool call that should be async, or a websocket reconnect that nobody noticed. Fix that one hop, ship, then re-trace. Five rounds of this beats one big rewrite. Do not chase percentiles on day one, chase the median user. The p99 chase is a trap that consumes weeks and changes the median by ten milliseconds. The honest progression below is what worked when I cut Sistava's median first-token time roughly in half over a quarter.
The unspoken part of low-latency work is the trust layer around it. A faster reply only matters if the user believes the system is reliable enough to depend on. That belief comes from a public status page they can check at 2am when something feels off. Without that artifact, every slow second turns into a private worry, and worried users churn quietly long before they ever file a ticket. The next two pieces, status monitoring and human-feeling speed, are linked in ways most teams underestimate.
Once the stack is shaped and the slowest hop is gone, the question shifts from how fast the system runs to how confidently users can rely on it during the hours they are not watching. That confidence is built by a few small artifacts that signal the system is monitored, healthy, and recovers on its own. Without them, a user's first slow reply becomes a story they tell themselves about the whole product. The next two sections cover the signals that scale trust alongside speed.
A status page is not a marketing artifact, it is part of the conversational AI architecture itself. When a user feels a slow reply, the next thing they do is open a second tab and look for confirmation that the system is okay. If you do not give them a status page, they make up their own story, and the story is almost always worse than reality. A live page that publishes latency by model, queue depth, provider health, and last incident lets a user verify the system the same way they verify their bank balance. The honest detail: build the status page before you start chasing tail-latency. It costs a week of work and saves months of support tickets. Sistava's status page runs independent of the main cluster on purpose, so it stays online even when the platform has a real incident. That independence is what makes the signal trustworthy at the worst possible moment.
Status page runs outside the main cluster so it survives the incident it is reporting on, not next to it.
Show real first-token and full-token times per model so users see speed claims backed by data.
Expose the upstream providers so users understand which slowdowns are local versus external.
Past incidents with root cause and fix, not vague apologies, build trust faster than any uptime number.
Speed is necessary but not sufficient for a conversation that feels human. The other half is rhythm: the small pauses where a thinking indicator appears, the confidence with which the first sentence streams, the way the agent acknowledges the request before doing the work. A 700ms first-token reply with a clean acknowledgement feels faster than a 400ms reply that jumps straight into a wall of text. Build the streaming UI to surface the agent thinking, name the tool it is using, and stream the answer in shaped chunks instead of a single burst. Users forgive latency they can see being spent on something. They do not forgive a frozen UI of any duration, because a frozen screen reads as broken even when the answer is on its way. The architecture choice that supports this is to push partial state forward on the websocket continuously, even when no tokens are flowing yet.
Target a median time to first token under 500ms and a full reply under three seconds for short answers. Anything above 1.5 seconds for first token reads as broken to most users, even when the final answer is correct.
No. Hosted providers like OpenAI, Anthropic, and OpenRouter routinely beat self-hosted setups on first-token latency because their infrastructure is closer to the user. Self-hosting wins on cost at high volume and on privacy for regulated workloads, not on raw speed.
The status page sits outside the main service path, ideally on a different cloud and different domain, so it can report on incidents without being affected by them. Treat it as a peer service that reads metrics, not a feature of the main app.
Use server-sent events for one-way streaming if your stack supports it. Websockets give you bidirectional control and tool-call updates, but they cost more to operate. Most conversational AI products start with SSE and migrate to websockets when they need tool transparency.
Sistava ships the streaming gateway, smart model router, lean memory layer, async tool layer, and a public status page as one platform. Founders subscribe and get the architecture configured, instead of stitching five providers together and maintaining the seams themselves.
If you want to go a layer deeper on the model side of this architecture, the next read is the practical companion. It covers how to mix a private and a public LLM behind a single routing layer so you keep speed where it matters and cost where it matters more, without the user noticing which model is answering. That hybrid setup is what lets a conversational SaaS hold both a low latency promise and a sane unit economics story at the same time, and it is the missing half of most architecture writeups.
The honest summary of this whole topic is that low-latency conversational AI is not a single trick, it is a set of small decisions that compound. Streaming at the edge, routing by task, leaning out memory, parallelizing tools, and publishing a status page each take a week to ship and pay back for years. Pick the slowest hop on a real user's session, fix it, measure, and ship. Do that five times and the median user will notice the difference without you ever announcing a speed update. The teams that win on conversational latency are not the ones with the fanciest infra, they are the ones with the shortest distance between a user complaint and a code change. Build for that distance first, and the architecture you end up with will feel inevitable in hindsight.