how to compare llms in one click with a prompt ide and an llm council
if you've ever tried to compare llms, you've probably done some version of this:
- • paste the same prompt into three or four models
- • flip between tabs
- • squint at the answers and think, "i guess this one?"
it works for a quick gut check. as an llm evaluation strategy, it's not great: slow, hard to repeat, and heavily influenced by whichever model or person is acting as the judge.
a cleaner option is to build comparison into the tool you already use to write and test prompts, and let a small council of llms help you decide which answer is better.
for any prompt or prompt chain:
pick models
collect answers
council votes
rank results
that turns "compare llms" from a messy manual chore into a small experiment you can run every time you hit "run."
> why playground-style llm comparison gets old fast
most prompt engineers start in playgrounds or console uis. that's fine until you hit a few friction points:
- • you end up with 4–5 tabs open (openai, anthropic, google, local playground…) and you're alt-tabbing like it's 2009.
- • you can't easily explain why you liked one answer more than another.
- • two people on your team run the same test and come back with different favorites.
for research prompts—summaries, analyses, "think with me" questions—this gets worse. there isn't a single canonical answer. small differences in grounding, clarity, or instruction-following matter, and a single "judge" (human or model) quietly shapes the whole stack.
we don't change how you write prompts; we change what happens after.
you start with whatever you'd normally test:
or a multi-step prompt chain: retrieve → outline → draft → refine.
in an ide, each candidate is something like:
- • model name
- • system prompt
- • a few settings (temperature, max tokens, etc.)
- • optional: chain definition
you tick the configs you want to test: maybe two frontier models and a local open-weights model, each with its own system prompt.
when you click "run across models" in a prompt ide:
- • the same input goes to each candidate configuration
- • each model returns an answer: r1, r2, r3, …
before anything gets judged, the system does a light cleanup:
- • normalizes obvious formatting differences (extra headers, boilerplate disclaimers)
- • tries to strip self-identification ("as gpt-4, i…") when possible
then it relabels the answers internally as answer a, answer b, answer c. judges won't see model names or provider hints; they just see the prompt and a pair of answers.
now you bring in a few judge models. these might be different from the candidates, and ideally from different families so you're not locked into one vendor's taste.
for each pair of answers—a vs b, a vs c, b vs c…—a judge sees:
- • the original prompt (and context, if you're using rag)
- • two answers in random order, labeled a and b
the judging prompt is narrow and repeatable. for example:
also say which is better at:
– actually answering the question
– staying faithful to the context
– following the requested format / tone
– being clear and easy to read."
you don't need a long essay back. a small json blob is enough:
{
"overall_winner": "a",
"better_task_success": "a",
"better_factuality": "a",
"better_instruction_following": "b",
"better_clarity": "a",
"rationale": "a directly answers the question and cites figures from the transcript; b is more verbose and adds speculation."
}to cut down on order bias, you can run each pair twice:
- • once with (answer a, answer b)
- • once with (answer b, answer a)
if the judge flips its decision when the order flips, you treat that comparison as a tie or low-confidence. if you have three judges, you can also take a majority vote.
for a single live prompt, you don't need full-blown stats. a simple scheme works well:
- • for each answer, count how many pairwise comparisons it wins
- • treat ties and low-confidence cases as neutral
- • if two answers are close, look at factuality or task success votes as tie-breakers
you end up with exactly what you want in an ide:
- • a clear winner for this prompt
- • one or two close alternatives
- • a few sentences explaining the council's reasoning
at that point, you can re-attach model names and show something like:
some teams just use this as "better autocomplete for their own judgment"; others are comfortable letting the council's top choice become the default config for that workflow.
> why this feels better than flipping between tabs
compared to the usual tab-flipping:
you save time
you can still inspect every answer, but you don't have to start from a blank wall of text. the ranking gives you a sensible order to read in.
consistent definition of 'better'
judges apply the same evaluation criteria every time, and they don't see model identities. you're less likely to overfit to one model's quirks.
you see trade-offs explicitly
one model may win on factuality, another on style. the council makes that visible instead of burying it in your memory of a dozen prompts.
and because the evaluation is phrased in terms of task success, factuality, instruction-following, clarity, it aligns better with how people actually talk about llm quality than a single "score out of 10."
> from one prompt to longer-term llm evaluation
so far this has all been about a single run: one prompt, a handful of models, one llm council.
if you log those council decisions over time—prompt (redacted), candidates, winners, dimension-level votes—you slowly build a dataset that reflects how your models behave on real work:
- • model x tends to win on factual accuracy in research-heavy flows
- • model y is better at following strict formatting for reports
- • a particular system prompt template holds up unusually well across topics
you can then use that for higher-level llm evaluation:
- • choosing default models per workflow
- • deciding when it's worth switching to a new provider
- • designing routing rules that aren't just "use the latest model everywhere"