Eval set quality
Run 15 to 25 fixed prompts each week, score them on a simple rubric, and compare scores over time.
Guide — — by Mahmoud Zalt
A practical checklist for catching AI model regression, data drift, and integration errors before they reach customers, written for non-technical operators on Sistava.
Model regression is the quiet kind of breakage. The AI keeps responding, the dashboards stay green, and the customer-facing flows look fine. But the answers get a little vaguer, the tone drifts off-brand, the sales replies start missing context they used to catch, and the support agent quietly stops citing the right help-center article. Nothing throws an error. The model just gets worse at the job you hired it for. For a non-technical operator, the trap is that you only notice once a customer complains or a deal slips, and by then you have lost two or three weeks of work to outputs you cannot trust. The checklist below treats regression and drift as an operational signal, not a research problem: something you eyeball on a weekly cadence, the same way you would check email open rates or a billing dashboard, with clear thresholds for when to investigate and when to roll something back.
Five signals cover almost every real-world regression I have hit running AI Employees on my own business. First, eval set quality: a fixed list of 15 to 25 prompts that represent the work the AI does, scored on a simple rubric each week so you can compare like with like. Second, input drift: what your customers, leads, or inboxes actually look like this week versus last month, because if the inputs shift the model has to handle work it was not really tested on. Third, integration error rate: failed tool calls, broken webhooks, expired tokens, 4xx and 5xx replies from the apps the AI Employee uses. Fourth, latency: how long the answer takes end to end, since slow creep is often the first hint of a degraded provider. Fifth, human override rate: how often you or a teammate had to step in and rewrite the AI output. Each of these is a leading indicator that something has shifted before the customer notices.
Run 15 to 25 fixed prompts each week, score them on a simple rubric, and compare scores over time.
Compare this week's customer questions, lead profiles, or inbox content against last month's baseline.
Track failed tool calls, expired tokens, 4xx and 5xx responses from the apps your AI Employee uses.
Watch end-to-end response time for slow drift, an early hint of degraded model providers.
Count how often a teammate had to rewrite AI output, a direct proxy for quality loss.
You run it as a 30-minute Monday ritual, not as a monitoring platform. The goal is not perfect telemetry, it is a habit that catches the obvious cliffs before they cost you. Most operators try to skip straight to dashboards and grafana panels, get overwhelmed, and stop. Five plain steps beat one fancy setup every single time. Each step takes a few minutes and gives you a yes-or-no signal. If two or more steps fail in the same week, you stop using the AI Employee in production for that task and dig in before customers notice. If only one step flickers, you note it, keep watching, and revisit on Friday. Treat the checklist as a flight check, not as a research project, and you will catch about 80 percent of real regressions inside the same week they happen rather than the next billing cycle.
What this looks like inside Sistava is that each AI Employee already exposes the override count, the integration error log, the latency trend, and the recent conversation history out of the box. The only piece you bring is the fixed eval set, which is a plain Google Doc with 15 to 25 prompts you wrote once and keep replaying. That doc plus the built-in panels covers all five signals, and the whole sweep takes under half an hour on a quiet Monday morning. You are not building a monitoring stack, you are running a habit.
If the previous section made the workflow feel abstract, the practical version is one operator, one cup of coffee, one document, and one dashboard. The rest is discipline. Most regressions I have seen on real customer flows look like a slow Monday-over-Monday slide in eval scores, paired with one or two new integration errors that nobody triaged. The next section answers the question people actually ask once they start running this: which check matters most when something does go wrong, and how do you tell a real regression from a noisy week.
Single weeks lie. The output of any LLM is non-deterministic, real customer inputs swing wildly, and a holiday or a campaign can throw any one signal off by 20 percent. The way you tell a real regression from a noisy week is by looking at two things together: direction and persistence. Direction means at least two of the five signals moved in the same bad direction this week (lower eval scores plus higher override rate, or higher latency plus more integration errors). Persistence means the same pattern holds two weeks in a row, not just on a Monday after a bank holiday. If both direction and persistence are there, you are looking at a real regression and you should pause the affected workflows. If only one is there, you note it, keep the AI Employee running, and re-check next week. This single rule saves you from chasing every flicker and keeps you honest when something genuinely breaks.
Likely noise. Note it, keep running the AI Employee, and re-check next Monday before acting.
Possible regression. Investigate the related conversations and integration logs this week.
Confirmed regression. Pause the workflow, switch the AI Employee to a fallback, and dig into causes.
Almost always noise. Holidays and campaigns skew customer inputs in ways the eval set will not catch.
Integration errors are the single most common cause of what looks like model regression but is not. The model is fine: the AI Employee just lost its grip on a tool. Watch four classes of error closely. First, expired or revoked OAuth tokens for Gmail, Slack, HubSpot, or your CRM, because the AI keeps trying and silently fails. Second, schema or field changes inside connected apps (a renamed Stripe field, a moved Notion property) that quietly break a previously-reliable tool call. Third, rate limit errors from the underlying LLM provider when traffic spikes, which often look like slow or empty responses rather than explicit failures. Fourth, third-party API outages, especially during weekday peak hours, which usually self-resolve but explain a bad Monday. Most AI Employee mistakes that get reported as regression turn out to be one of these four when you check the logs, which is why integration error rate is signal number three in the weekly checklist.
Weekly is the right cadence for most non-technical operators. It is frequent enough to catch real regressions before customers notice and slow enough that you do not chase noise. Daily monitoring is only worth it for high-volume production workloads where a bad day costs real money.
No. The five-signal checklist works without a data science team. A non-technical operator can run it in under 30 minutes a week with a fixed eval set in a Google Doc and the built-in panels Sistava exposes on each AI Employee.
Model regression means the AI gets worse at the same task. Data drift means the inputs it sees have shifted (different customer questions, new product launch, seasonal change). Drift often causes apparent regression because the model is being asked to handle work outside its original baseline.
Start with 15 to 25 prompts that represent the work the AI Employee actually does. That is enough to surface a real regression without becoming a chore to score. Grow the set only when you find a category of failure you want to track over time.
Pause the affected workflow, switch the AI Employee to a fallback (a simpler prompt, a different model, or a manual hand-off), and check the integration logs first because most regressions are tool failures, not the model itself. Then re-run the eval set after the fix to confirm recovery before resuming production.
If you want to go one layer deeper on the integration side of the checklist (which apps to connect first, which OAuth flows tend to break, and how to harden tool calls so they fail loudly instead of silently), the companion guide on enterprise integration steps walks through the practical wiring. It is the piece I wish I had read before connecting my first AI Employee to a real CRM and inbox. Use it next once you have the weekly drift habit in place.
The honest framing on regression and drift: nobody catches every regression, and the operators who try to build a perfect monitoring stack almost always burn out before week three. The operators who win run a small, repeatable, 30-minute Monday checklist and treat the five signals as a flight check, not a research project. The cost of missing a regression is one bad week of outputs you have to apologize for, not a catastrophe. The cost of never checking is a slow drift away from the quality you originally hired the AI Employee for, the kind that loses you a customer six weeks later without a single error in any log. Pick the habit over the dashboard. Replay the eval set on Monday, sample real conversations, glance at integration errors and latency, count the overrides, and decide based on direction plus persistence. That is the entire checklist, and it has caught every real regression I have personally hit on Sistava in the last year.