Why Use More Than One LLM
Why Use More Than One LLM
I started with one model. It was good. I assumed the right answer was "find the best model and stick with it" — like picking a code editor or a phone OS. Switching is overhead. Lock-in is fine if the lock is worth it.
I was wrong. Not because any one model was bad, but because using only one model hides things that only show up when you can compare.
This is a piece about what those things are.
What single-model use hides
Three categories of structural insight you can't get from inside one ecosystem.
1. Model-specific failure modes look like AI failure modes
When your only frame of reference is one model, every weird behavior reads as "that's how AI works." You start writing workarounds for failures that aren't universal. You design your workflow around quirks that the next model over doesn't have.
A concrete example. Model A is consistently overconfident on factual claims — it volunteers numbers and dates and citations with the same tone whether it actually knows them or is filling in plausible blanks. So you build elaborate verification habits. You write a brain-file rule that forces it to label confidence on every claim. You learn to second-guess anything specific. Then you try Model B and find it volunteers uncertainty unprompted. The habit you built wasn't a universal AI hygiene practice. It was a coping mechanism for one model's training distribution.
This happens at every level — not just factual confidence but verbosity, structure, willingness to disagree, eagerness to refactor. The corrections you've been making are partly to AI and partly to that AI. You can't tell which until you have something to compare against.
2. The "obvious right answer" is often a model preference
Ask three frontier models how to structure a system, and you'll get three different "obvious" answers. Each will be confident. Each will sound reasonable. None of them is necessarily correct — they're reflecting their training distributions and house preferences, the dominant patterns in the codebases or arguments or essays they were trained on.
When you only see one answer, you can't tell whether it's right or that model's opinion. With multiple models, the disagreements become legible. Where they agree, you're probably on solid ground. Where they diverge, you've found a real decision to make — and the divergence itself is information about what's actually contested in the field versus what's just one community's convention.
This shows up most clearly in structural decisions. Architecture choices, naming conventions, how to organize a directory tree, how to break work into functions. Single-model routing makes these feel obvious; multi-model exposure reveals they're choices.
3. Your prompts are co-evolved with one model's quirks
The prompts you've refined over months for Model A are partly good prompts and partly Model-A-specific tuning. When you move work to Model B, the prompts that worked beautifully might fall flat — not because B is worse, but because the prompts were over-fit to A's expectations.
Multi-model use forces you to write prompts that are robust, not just effective for one model. The result is better thinking about what you actually want from the agent, independent of any one model's preferences. Specifying outcomes more clearly. Stating constraints more explicitly. Naming the format you want instead of relying on a model's default.
The prompts that survive a model switch are the ones that were doing real work. The ones that broke were doing magic that looked like real work.
The practical cases
Beyond the structural insights, there are tactical reasons to run more than one.
Different models are stronger at different things. This is unstable — leadership rotates with releases — but at any moment, one model is better at code reasoning, another at long-context synthesis, another at OCR or vision. Routing work to the right model matters.
Cross-model verification. When two models independently arrive at the same conclusion, your confidence should rise. When they diverge, you've found something worth examining. This is generator-critic at its simplest. You don't need a fancy multi-agent framework — you need to run the same hard question through two different models and read the gap.
Failure mode isolation. When one model gets weird — a bad release, a transient API issue, a guardrail change that suddenly refuses work it used to handle — you have somewhere to go. You're not locked in. The day a model regression breaks your workflow, having a second tool isn't redundant. It's continuity.
Cost structure variation. API pricing, subscription tiers, usage caps differ. Heavy work on one platform may be cheaper than the same work on another. You don't have to commit to a single vendor's economics.
The costs are real
This isn't free.
Setup compounds. Each model is a separate install, login, configuration. Brain-file conventions differ — Claude reads CLAUDE.md, Codex reads AGENTS.md. Getting two models running well takes more than twice the effort of one, because you're also building the routing instinct, not just two parallel installs.
Context fragmentation. Conversation history doesn't transfer between models. Files do — which is one reason file-based context (the topic of the previous tutorial) becomes the bridge. The agent's working memory has to live in your filesystem, not in any single model's session, or moving between models means restarting from zero.
Decision fatigue. Every task now has a "which model" question on top of "how should I do this?" The decision becomes lighter with practice but never fully disappears. For routine work it becomes automatic. For ambiguous work, the routing question itself becomes part of the cognitive load.
Mental model overhead. You have to keep multiple models' tendencies in your head and route accordingly. Where one is overconfident, where another hedges. Where one over-refactors, where another stays in scope. This is real working memory, not just trivia.
How to start without drowning
The realistic starter pattern, not the maximalist one.
Pick a primary. One model handles eighty percent of work — your default. Subscription if heavy use. Match wallet to usage as in the previous article on tool selection.
Add one secondary for cross-checking. When something matters — a critical document, a structural decision, a piece of work you can't easily verify — run it through the secondary independently. Compare the outputs. Where they agree, ship. Where they disagree, dig in.
Notice when the secondary outperforms. That's data about which work belongs on which model. Over time, you'll route certain task types automatically — without thinking about it, your hand goes to one tool for one kind of work and another tool for another. That's the routing instinct earning its keep.
Don't try to use three models in parallel for the same task. That's not multi-model use; that's a performance review of language models with you as the grader. Tedious, slow, and the value is small. Two models comparing on important work captures most of the cross-validation benefit. A third active in your rotation is fine for specialized capability access (live web, OCR, etc.) — but not for parallel comparison on the same task.
This is sustainable. Two models, one primary one secondary, used for different things, gradually finding their lanes.
The insight that changed my mind
The thing that finally convinced me wasn't any one feature. It was this: the parts of my workflow that survived the move from one model to another were the parts that were actually well-designed. The parts that broke were the parts that had been tuned to a specific model's quirks I'd mistaken for universal patterns.
Cross-model use is a forcing function for better workflow design. The models are good. The discipline of using them in the open, not in lock-in, is what makes you better at using them.
If you're running one model and thinking "this works fine," you're probably right — for now. The cost of having only one tool isn't visible from inside that one tool. It only shows up when you switch, when a model regresses, when a release rotates the leaderboard, or when you realize you've been working around a problem that was never supposed to be your problem.
The next article covers how I actually route between Claude, ChatGPT, and Gemini for different kinds of work — the practical heuristics, not just the case for diversification.