back_to_blog
// practical_guide

how to compare llms in one click with a prompt ide and an llm council

10 min readdecember 2024

if you've ever tried to compare llms, you've probably done some version of this:

  • • paste the same prompt into three or four models
  • • flip between tabs
  • • squint at the answers and think, "i guess this one?"

it works for a quick gut check. as an llm evaluation strategy, it's not great: slow, hard to repeat, and heavily influenced by whichever model or person is acting as the judge.

a cleaner option is to build comparison into the tool you already use to write and test prompts, and let a small council of llms help you decide which answer is better.

this post walks through one concrete pattern for doing that in a prompt ide, with x47.ai as an occasional example of how it can look in practice.

for any prompt or prompt chain:

1

pick models

pick a few llms (and configs) you want to compare.
2

collect answers

send the same input to each and collect their answers.
3

council votes

ask a small llm council to compare answers in blind a/b pairs on things like task success and factuality.
4

rank results

rank the answers based on the council's votes and show the winner—with a short "why."

that turns "compare llms" from a messy manual chore into a small experiment you can run every time you hit "run."

> why playground-style llm comparison gets old fast

most prompt engineers start in playgrounds or console uis. that's fine until you hit a few friction points:

  • • you end up with 4–5 tabs open (openai, anthropic, google, local playground…) and you're alt-tabbing like it's 2009.
  • • you can't easily explain why you liked one answer more than another.
  • • two people on your team run the same test and come back with different favorites.

for research prompts—summaries, analyses, "think with me" questions—this gets worse. there isn't a single canonical answer. small differences in grounding, clarity, or instruction-following matter, and a single "judge" (human or model) quietly shapes the whole stack.

a more robust pattern is: multiple models answer → multiple models judge → you get a structured recommendation, not just a pile of text. that's the llm council idea.

we don't change how you write prompts; we change what happens after.

you start with whatever you'd normally test:

// single_prompt_example
"summarize this earnings call transcript for a product manager and highlight three product risks."

or a multi-step prompt chain: retrieve → outline → draft → refine.

in an ide, each candidate is something like:

  • • model name
  • • system prompt
  • • a few settings (temperature, max tokens, etc.)
  • • optional: chain definition

you tick the configs you want to test: maybe two frontier models and a local open-weights model, each with its own system prompt.

in x47, for example, each row in the sidebar is one of these configs—you can flip them on and off for a run.

when you click "run across models" in a prompt ide:

  • • the same input goes to each candidate configuration
  • • each model returns an answer: r1, r2, r3, …

before anything gets judged, the system does a light cleanup:

  • • normalizes obvious formatting differences (extra headers, boilerplate disclaimers)
  • • tries to strip self-identification ("as gpt-4, i…") when possible

then it relabels the answers internally as answer a, answer b, answer c. judges won't see model names or provider hints; they just see the prompt and a pair of answers.

now you bring in a few judge models. these might be different from the candidates, and ideally from different families so you're not locked into one vendor's taste.

for each pair of answers—a vs b, a vs c, b vs c…—a judge sees:

  • • the original prompt (and context, if you're using rag)
  • • two answers in random order, labeled a and b

the judging prompt is narrow and repeatable. for example:

// judging_prompt
"given the user's request and the context, which answer is better overall?
also say which is better at:
– actually answering the question
– staying faithful to the context
– following the requested format / tone
– being clear and easy to read."

you don't need a long essay back. a small json blob is enough:

// json_response
{
  "overall_winner": "a",
  "better_task_success": "a",
  "better_factuality": "a",
  "better_instruction_following": "b",
  "better_clarity": "a",
  "rationale": "a directly answers the question and cites figures from the transcript; b is more verbose and adds speculation."
}

to cut down on order bias, you can run each pair twice:

  • • once with (answer a, answer b)
  • • once with (answer b, answer a)

if the judge flips its decision when the order flips, you treat that comparison as a tie or low-confidence. if you have three judges, you can also take a majority vote.

in x47's internal experiments, this double-check is where a lot of "weird" comparisons show up—useful for debugging your judging prompts.

for a single live prompt, you don't need full-blown stats. a simple scheme works well:

  • • for each answer, count how many pairwise comparisons it wins
  • • treat ties and low-confidence cases as neutral
  • • if two answers are close, look at factuality or task success votes as tie-breakers

you end up with exactly what you want in an ide:

  • • a clear winner for this prompt
  • • one or two close alternatives
  • • a few sentences explaining the council's reasoning

at that point, you can re-attach model names and show something like:

1. model b – best overall
council liked its task success and factual grounding
2. model a
tied on clarity, more verbose
3. model c
shorter, missed some key details from the transcript

some teams just use this as "better autocomplete for their own judgment"; others are comfortable letting the council's top choice become the default config for that workflow.

in x47, this shows up as a "council panel": a ranked list of answers with small badges ("best factuality", "best clarity") and a summarized rationale lifted from the judges.

> why this feels better than flipping between tabs

compared to the usual tab-flipping:

you save time

you can still inspect every answer, but you don't have to start from a blank wall of text. the ranking gives you a sensible order to read in.

consistent definition of 'better'

judges apply the same evaluation criteria every time, and they don't see model identities. you're less likely to overfit to one model's quirks.

you see trade-offs explicitly

one model may win on factuality, another on style. the council makes that visible instead of burying it in your memory of a dozen prompts.

and because the evaluation is phrased in terms of task success, factuality, instruction-following, clarity, it aligns better with how people actually talk about llm quality than a single "score out of 10."

> from one prompt to longer-term llm evaluation

so far this has all been about a single run: one prompt, a handful of models, one llm council.

if you log those council decisions over time—prompt (redacted), candidates, winners, dimension-level votes—you slowly build a dataset that reflects how your models behave on real work:

  • • model x tends to win on factual accuracy in research-heavy flows
  • • model y is better at following strict formatting for reports
  • • a particular system prompt template holds up unusually well across topics

you can then use that for higher-level llm evaluation:

  • • choosing default models per workflow
  • • deciding when it's worth switching to a new provider
  • • designing routing rules that aren't just "use the latest model everywhere"
the important bit is that you didn't have to run a giant offline benchmark first. you got the data as a side effect of making "compare llms" a one-click operation inside your prompt ide—backed by a small, opinionated, and blind llm council.