llm evaluation for middleware beyond "gpt says its okay" and "the team thinks its fine"
on monday morning, your platform team finally ships a new feature: an "internal research brief generator" an LLM wrapper by any other name... that turns long reports into two tight paragraphs for product managers.
you hook it up to the model your stack already knows—gpt—because that's what everything else in your middleware talks to. the demos look great. people use it. slack reacts are positive. big green check.
two weeks later, finance drops the inevitable question:
"do we really need the expensive model for this? can we route to something cheaper or local?"
you do the sensible thing. you take a handful of real docs, run them through three options:
- >the premium model you're using now,
- >a cheaper hosted model,
- >and a local model infra's been itching to justify.
you paste the outputs into a doc, share it around, and everyone scrolls.
"honestly, they all seem okay."
"i like b's tone more, but a is fine."
"hard to say without looking at a lot more…"
the meeting times out. nobody wants to flip a production workflow based on vibes and a few screenshots, so you default to safety:
"let's leave it on the current model for now and revisit later."
later never really comes.
you're not doing anything wrong. you're just stuck with the usual combo:
- >the model you integrated first, plus
- >whoever had time to skim some outputs.
this post is about a way out of that rut that's realistic for middleware and tooling teams— internal llm platforms, research agents and apps that orchestrate multiple models behind one api.
no research lab. no giant leaderboard. just a clever council of models and the basic functionality your tools need to support that.
tl;dr
- >the usual pattern: you pick a model you know, tweak prompts until outputs "look fine," and move on.
- >the better pattern: for each workflow, keep a tiny set of real examples and let a small council of 3–5 models judge blind a/b comparisons between different model outputs.
- >what your tools need: a place to store eval sets, run multiple models, call judge models blind, and see simple win-rates + examples so comparing llms becomes part of your middleware, not a one-off spreadsheet.
01where things actually go wrong: familiarity + "looks fine"
two forces tend to dominate.
the eval loop is usually something like:
- 1.run 10–20 real docs through the new setup.
- 2.have a pm, engineer, or domain expert read the summaries.
- 3.fix the worst failures: hallucinations, missing big points, weird tone.
- 4.if there are no obvious disasters, ship it.
sometimes you'll involve an llm to rate or label outputs, but it's usually seasoning. the real gate is still "do these examples feel okay to us?"
that's fine for early prototyping. it's less fine when:
- >you're running dozens of workflows through one platform,
- >you're trying to justify model and vendor spend,
- >or you're mixing premium, mid-tier, and local models.
you don't need a research group. you do need something a bit more structured than "we eyeballed a doc."
when a new workflow shows up, the story is almost always:
- 1.wire it to the model your platform already uses (gpt, claude, gemini, a local model, etc.)
- 2.iterate on the prompt until the outputs are "good enough"
- 3.run the output through another instance of the same model to validate
- 4.maybe try one other model. if it isn't obviously better, move on
from that point on, your mental picture of "good" is calibrated to that first model. your infrastructure then gets locked into the model and every other output is judged relative to it, not to the actual job the workflow is supposed to do.
your middleware quietly takes on the personality of the first serious model you integrated.
02a different mental model: a small council of models
you don't need to know which model is "best" in the abstract. you need to know, for this workflow:
"which model + prompt combo does a better job on the type of inputs we actually see?"
one useful pattern:
still human-in-the-loop. still shaped by your goals. just less guesswork.
the rest of this post is: what your tools need to make that easy.
03functionality you need for a small llm council
create a small evaluation set per workflow
first capability: your platform or ide should let you iterate across platforms as everything begins with that initial prompt
for the brief generator, that looks like:
- >set of prompts or a logical prompt chain aimed at a research question
- >initial hypothesis or supporting documents ready for upload
with a tool that lets you:
- >query your set of criteria across any llm that you have access to
- >collect the answers through each steps in the research journey
- >cross reference and validate the outputs
call a small council of judge models, blindly
now the "council" part. you want your tools to be able to:
- >take pairs of candidate outputs,
- >hide which model produced which,
- >ask judge models to pick between a and b,
- >and record the results.
concretely, your tool should let you:
- >configure a judge council: 3–5 models that will act as critics.
- >generate a/b pairs for each input: shuffle which model is "a" and which is "b," ensure anonymity—no model names in the judging prompt.
• does it capture the main points?
• is it faithful to the document?
• does it follow the requested length and format?
• is it clear and easy to read?"
- >call each judge model on some set of pairs.
- >store: judge, input, a vs b, and explanation.
this is where tooling pays off—doing this by hand or notebooks gets old fast.
show patterns and let humans sanity-check
finally, your tool needs a way to show patterns instead of raw text dumps.
for each workflow and eval run, you want:
- >per-model win-rates: e.g., "model b wins 67% of blind comparisons vs model a on this eval set."
- >agreement across judges: are all judge models roughly aligned, or is one consistently contrarian?
- >interesting examples: inputs where one model won by a wide margin, judge explanations were strong, or judges disagreed.
in the ui, that might look like:
- >a table: model → overall win-rate → notes (e.g., "strong on long docs").
- >a list of "cases to inspect," each clickable to reveal: the input, all models' outputs, judge rationales.
humans still make the call. the tool just gets them to the right examples faster, with some basic stats to back the conversation.
04guardrails so this stays practical
you can bake three simple rules into your design and prompts to keep this sane.
your tools should strongly encourage using multiple judge models—3–5 is usually enough.
- >don't offer a "single judge" default.
- >make the "council" experience first-class: a small list of critic models, not one magic referee.
this reduces the risk that your entire platform ends up inheriting one model's quirks about tone, length, or style.
the tooling should:
- >strip model names from a/b pairs.
- >fix the judging frame to pairwise ("a vs b for this input"), not isolated scoring.
- >make it hard to accidentally leak model identity into the judge prompt.
that's how you end up with usable stats like:
"on this workflow, the mid-tier model's answers win about 60% of blind comparisons against our default."
…instead of "the judge liked the one from model x more," which is basically brand bias.
your tool should let you:
- >define judging prompts per workflow, not globally.
- >capture the things that matter for that workflow: summaries: task success, faithfulness, clarity; code: correctness, safety, style; support: tone, accuracy, actionability.
a reasonable pattern is: a shared base judging template, plus workflow-specific guidance ("busy pm", "non-technical user", "internal sre audience"), stored alongside the eval set.
05why this functionality is worth building in
all of this boils down to a few very practical payoffs.
cleaner model and vendor choices
with per-workflow councils, you can say things like: "for research-brief, the cheaper hosted model wins ~65% of blind a/bs vs our default across 30 real docs. the default still wins on extremely long inputs, so we'll keep it as a fallback." that's much easier to defend than "we tried it and it looked okay."
real cost–quality trade-offs
instead of arguing in the abstract about "cheaper vs better," you can: move some workflows to cheaper or local models when the council says they're effectively equivalent. keep premium models where they really earn their cost, with examples to show. your tools don't need to make the decision, but they should make the trade-offs visible.
fewer 'something feels off' regressions
when you upgrade a model or tweak a prompt without a test bench, regressions usually show up as vague complaints. with eval sets and councils baked into your middleware tooling, you can: re-run evals before rollout. see where new behavior changes things—for better or worse. roll back or adjust with eyes open.
a story for internal users and customers
if your platform is multi-tenant—serving multiple teams or external customers—this functionality gives you a solid answer to: "how do you choose which models to support?" "how do you test before you switch?" "how do you avoid hard vendor lock-in?" you can point to the eval sets and council runs per workflow, instead of hand-waving about "experiments we did a while ago."
06how this fits into a ide or internal console
putting this together, a minimal "evaluation" experience in your tool might have:
eval sets
per workflow (e.g., research-brief, support-summary, code-review). 20–50 real examples each, stored and versioned.
models under test
a simple way to select which llms to compare for that workflow. for each eval run, the tool stores outputs per (input, model).
judge council
3–5 models that act as critics. configurable judging prompts per workflow. automatic a/b pairing and anonymization.
run evaluation
one button / api call to: generate outputs, run blind a/b judging, aggregate results.
results view
win-rates per model. interesting examples with side-by-side outputs. optional breakdowns (e.g., long vs short inputs).
that's enough functionality to move you from "we eyeballed it" to "we have a repeatable way to compare llms for each workflow we own."
07if you want to start this week
you can prototype the pattern even before your tooling fully supports it:
- 1pick one workflow
the one that gets the most "do we really need this expensive model?" questions.
- 2save 20–30 real inputs
throw them in a repo or a simple json file. treat it as the eval set.
- 3run 2–3 models
your default, a cheaper alternative, and (if relevant) a local one.
- 4use 3 judge models
call them manually or via a script with blind a/b prompts.
- 5review high-disagreement cases
look at where the judges strongly prefer one model, and decide what you'd do if this were hooked into your platform.
once that feels useful, the next step is obvious: bake these steps into your middleware or ide, so evaluation isn't a side project—it's just part of how your llm workflows work.