Tight grounding
Refuses or hedges when retrieved context does not actually contain the answer.
Guide — — by Mahmoud Zalt
A practical guide to picking the right LLM by task (summarization, Q&A, tool-using agents) balancing cost, latency, context window, and quality.
Summarization is the cheapest task in the LLM stack, so the trap is overspending on a flagship when a mid-tier model would land the same output for a tenth of the price. The decision shape I use: input length first (does it fit in context), then cost per million tokens, then quality on long inputs, and only last does general reasoning matter. Long-context Gemini, Claude Haiku, and the smaller GPT tiers all do summarization fine for most business documents. The bigger lesson is that summarization rarely needs the smartest model in the room. What it needs is a model that does not hallucinate facts that were not in the source, follows your style guide, and handles your real input length without truncation. Test on your worst document, not your cleanest one. If the worst case survives, the average case is safe.
Q&A on private documents is the most misread task in the category, because people blame the model when the real failure is retrieval. The model only sees what your retriever hands it, so picking an LLM here is really picking the partner for your retrieval pipeline. You want a model that follows instructions tightly, refuses confidently when context is missing, and stays grounded inside the passages you injected instead of drifting into training data. Mid-tier instruct models from the top three labs do this well today. Where it gets subtle is multi-hop questions that require synthesizing across passages: that is where the smarter tier earns its price. The right default is a cheap grounded model for single-shot answers and an escalation path to a smarter model only when the retriever returned more than three passages and the question is comparative or multi-step.
Refuses or hedges when retrieved context does not actually contain the answer.
Handles 20+ retrieved passages without losing accuracy in the middle of the window.
Respects format, citation, and refusal instructions across hundreds of calls a day.
Says I do not know when context is missing, rather than confabulating a plausible answer.
Returns in under three seconds on your p95, because Q&A is a foreground UX moment.
Tool-using agents are where most teams overspend and underperform. The agent loop calls the model many times per task, so any extra cost per call multiplies by ten or twenty by the time the task finishes. At the same time, the model needs reliable function calling, stable JSON output, and the judgement to stop when a step succeeds (rather than looping). The right pick favors a model the lab has explicitly tuned for tool use, not just chat. Honest names in the category today: Claude, GPT-4 class, and the better open weights for cost-sensitive routes. I rank tool reliability above raw IQ for this slot. A smarter model that returns malformed JSON one call in fifty wrecks the loop; a less impressive model with a 99.9% JSON success rate ships. Frameworks like CrewAI, LangChain, and n8n give you the wiring, but they do not pick the model for you.
The reason this matters: most teams pick one model for everything and either overpay for summarization or under-deliver on agents. Routing per task type is the single highest-leverage decision in the stack. It is also the single most annoying thing to maintain by hand, because providers change pricing, deprecate models, and ship new tiers every few weeks. That overhead is exactly where a managed workforce platform starts to pay back: the routing decisions get made centrally and updated when the market shifts, instead of every team patching their own config files.
If you are evaluating models for a single function (say a content writer or a sales researcher), the cleanest test is to hire that role inside a workforce platform that already routes intelligently, then judge the output on real work for a week. You learn more about model fit in five real tasks than in fifty benchmark runs, because real inputs are messier than benchmark inputs and the right answer is rarely the cleverest one. The cheaper the test loop, the faster you converge.
Cost, latency, and context are three sliders that pull against each other. Push context up (long-context Gemini, large Claude window) and you pay more per token while latency drifts up too. Push cost down (Haiku, smaller GPT tiers, open weights) and you usually lose some accuracy or context room. Push latency down (streaming, smaller models, edge deployment) and you constrain which models qualify. The trick is to set the binding constraint first per task, then optimize the other two around it. For an in-app summarize button, latency binds: cap at two seconds and shop for the cheapest model that hits it. For an overnight research agent, cost binds: pick the cheapest model that produces acceptable output and let it run slow. For a long-document Q&A, context binds: the model must fit the document, then optimize cost and latency inside that constraint.
Latency binds. Cap at two seconds p95, pick the cheapest model that fits the budget.
Cost binds. Run slower if needed. Cap per-task spend instead of per-call spend.
Context binds. Document must fit in one window. Optimize cost inside that constraint.
Reliability binds. Tool-call success rate matters more than raw IQ or price per call.
There is a real point where running your own model selection stops being interesting and starts being a tax. The signs: you are reading provider changelogs every week, you have a Slack channel for model deprecations, you are A/B testing prompts against three providers, or you are paying engineers to babysit token budgets instead of shipping features. At that point, a managed AI Employee platform earns its keep. The platform absorbs the routing decisions, eats the provider churn, and gives you a fixed per-month price instead of a meter that surprises you on the first of the month. Sistava does exactly this: each AI Employee gets the right model for its job, the routing updates centrally when new models ship, and your subscription covers it. Honest credit where it is due: Lindy, CrewAI, n8n, and LangChain each solve a slice of this problem with different trade-offs, and they are the right pick if you want full custody of the wiring.
No. Each task has a different binding constraint, so the right model is rarely the same. Summarization wants cheap and long-context. Q&A wants grounded and instruction-tight. Agents want reliable function calling. Routing per task type is the highest-leverage decision in the stack.
Measure the 95th percentile of your real inputs in tokens, then pick a model whose window is at least double that. The double buffer covers prompt overhead, system instructions, and retrieved passages without truncation. Most business documents fit comfortably under 200k tokens.
No. For grounded Q&A and faithful summarization, mid-tier models often match or beat flagships because the task is constrained by the source material, not by model IQ. The flagship premium pays off on open-ended reasoning, multi-hop synthesis, and ambiguous instructions.
Run a 50-example eval on real production inputs, scored by a simple rubric (faithfulness, format, refusal). Two engineers can do this in a day. Avoid public benchmarks: they correlate weakly with how a model performs on your actual data.
Because picking the right model per task, per Employee, per workflow, and keeping that map current as providers ship and deprecate models, is full-time work. Sistava routes centrally so each AI Employee gets the right model for its job and you get one flat subscription instead of a metered bill.
The shortest version of this entire piece: model selection is a routing problem disguised as a shopping problem. The teams that win are not the ones who picked the smartest model. They are the ones who picked the right model per task and stopped paying for the wrong job. If you are running this stack yourself, the next read covers how I actually staff a team of AI Employees on top of that routing, role by role, with the playbook I use on my own business.
If you are still in the picking phase and not yet building, the most useful thing you can do this week is define your task shapes in writing: what are the three jobs you actually want an LLM to do, what is the worst input each one will face, and what latency budget each one carries. Once that document exists, model selection becomes a matter of matching, not a matter of guessing. The picks change every few months as the labs ship, but the shape of the decision does not. Build the routing once, revisit it quarterly, and stop reading leaderboards. If you want the routing built for you, with each AI Employee already pointed at the right model and the bill capped at a flat monthly subscription, that is exactly the slot Sistava fills. Either path works. The expensive path is owning none of those decisions and discovering the wrong default on your worst-case input in production.