How I Choose Between Claude, ChatGPT, and Gemini

There's no "best" model. There's "right model for this kind of work, given the current state of the field." This is a practical guide to how I actually route tasks across the three frontier LLMs — not a feature comparison, not a benchmark report, just the heuristics that survive contact with real work.

A caveat up front: everything specific in this piece will be wrong in three months. Frontier rotates. Treat the routing below as a worked example of the shape of a routing strategy, not a permanent prescription. The principles will hold longer than the model assignments.

Why "just pick the best" doesn't work

Most people land on "best model" through one of four paths:

One benchmark moment looks definitive
A trusted creator endorses one
The first one they tried was good enough
Lock-in inertia

All of these are real reasons. None of them produce a routing strategy that holds up over six months of frontier rotation. The leaderboard moves; your task mix doesn't. The model that wins on coding benchmarks today may not be the one that handles long-context synthesis next quarter, and your work isn't all coding or all synthesis.

The trap is thinking "best" is a property of the model. It isn't. "Best" is a property of work-meeting-model — the right tool for what you're actually doing right now.

The three dimensions that actually matter

Forget benchmarks. Three things decide routing for me.

1. Cognitive shape of the work

Different work needs different model behaviors. Roughly:

Synthesis-heavy work — long context, distill many sources into a coherent argument. Fits whichever model has the strongest long-context behavior right now without losing the thread.
Code reasoning and structured editing — fits whichever model has the most disciplined reading-before-writing behavior. Models that hallucinate file contents are dangerous here.
Conversational drafting and tone work — fits whichever model has the writing voice you trust for that kind of output. This is highly subjective. Read three drafts of the same essay across three models and you'll feel the differences immediately.
Adversarial / critical review — fits whichever model is willing to disagree with you cleanly. Some models hedge into uselessness when you ask them to push back. Others push back too hard and miss the point. You want the one that holds the line on substance without performing skepticism.

These aren't fixed assignments. They're patterns the model has to fit for this kind of work.

2. Trust posture

How much can I trust the output without independent verification?

Some models trend confidently wrong on edge cases. They're useful where verification is cheap — code that you'll run and see the result, calculations you can spot-check, drafts you'll read before sending. Avoid them for outputs you'd ship without reading carefully.

Some models hedge more, often unnecessarily, but rarely produce confident hallucinations. They're slower in conversation but safer for one-shot work where you won't have time to verify.

Knowing each model's failure mode is a form of routing intelligence. The model with the highest peak performance isn't always the right one. The model with the failure mode that fits your verification capacity is.

3. Capability access

Some work needs a specific capability — and only one model has it, or has it well enough.

Web access and live retrieval
OCR and vision quality good enough for actual scanned documents
File integration with services you use (calendar, email, docs)
Tool ecosystems (skills, hooks, MCP servers, sub-agents)

These shift with every release. At any given moment, certain work has to go to a specific model because nothing else can actually do it.

My current routing (as of this writing)

Important: this will be wrong in three months. Use it as a worked example of the shape of routing, not a stable map.

Default work — Claude

Drafting essays, reading and analyzing documents, code structure work, decision-document creation. Default home for ambiguous work because:

Strong reading-before-writing discipline (matters for code and for structured documents I want it to actually follow)
Reasonable willingness to disagree without being contrarian
Long context that holds up under load — doesn't get noticeably worse at turn 40 the way some models do
The CLI tool (Claude Code) is where most of my actual production work happens, and the CLI integration changes the cost-benefit math for everything else

When I'm not sure where a task should go, it goes here.

Email, Calendar, Live Web — Gemini

Anything that needs to talk to Gmail, Google Calendar, or the live web with retrieval. Not because Gemini is better at reasoning — it isn't, in my experience — but because it has access I can use and the others don't, or don't reliably.

Gmail integration, native
Calendar reading and analysis at scale
OCR-heavy PDFs (scanned legal documents, scanned contracts) — substantially better than the alternatives in my testing

This is the clearest case of capability-access routing. Gemini gets the work because nothing else can do it well. The reasoning quality is fine.

Architecture critique, system engineering — Codex / GPT

When I need a second opinion on system design, architecture decisions, or scripting work, GPT through Codex CLI gets the call. Different training, different defaults, useful as a counterweight to Claude.

Strong at proposing alternatives I wouldn't have considered
Different "obvious" answers than Claude — useful precisely because of the divergence
When I want adversarial review of something I built, GPT is the second seat. Asking the same model that wrote a plan to critique the plan rarely produces the friction you actually need

This is where the multi-model insight from the previous article pays off most. The disagreement between Claude and GPT on the same architectural question is information you can't get from either alone.

When the routing is unclear

When I genuinely don't know which model fits, I run the same task through two and compare. Cheap (15 minutes), informative (almost always one is clearly better for this kind of task), and over time it builds a routing intuition that doesn't need explicit checking.

The ten minutes I spend running both today saves an hour of "this output isn't quite right" later.

Anti-heuristics — things I stopped doing

Traps that wasted time before I found the routing above.

"Pick the model with the best benchmark." Benchmarks measure benchmark performance. The work I do isn't on the benchmark. A model can win a coding benchmark and still hallucinate file contents on real code. A model can lose a long-context benchmark and still hold a 50-page conversation more usefully than the winner.

"Pick the model AI Twitter is excited about this week." AI Twitter cycles faster than your task mix. The model that's the toast of Friday's discourse may not be the one you want for Monday's work. Lag the hype by a quarter.

"Pick the cheapest." Cost matters at scale, not at the level of individual tasks. The cost of choosing the wrong model for important work dwarfs the cost of API calls. Optimize for output quality on hard work; optimize for cost only on routine work where the quality difference is negligible.

"Pick the one with the most features." More features means more attack surface for the model to misuse. I want the model that's good at the few things I'm asking for, not one that does everything middlingly. Feature count is a marketing metric, not a routing metric.

The subscription question

Realistically, you don't run all three. You pay for one or two.

One subscription, occasional API access to others. Most realistic for individual users. Pay $20/mo for your primary, use API credits for the secondary when verification matters. Total monthly: $20 plus $5–20 in occasional API.

Two subscriptions, no API. What I do — Claude Max for production, ChatGPT Plus for cross-checking and Codex CLI access. Around $120/month for both. Pays for itself if your work is heavy enough that you're using both substantively most weeks.

Three subscriptions. Diminishing returns unless you have an unusually broad workload across all three providers. For most people, the third subscription is buying capability access (Gemini for Gmail/Calendar/OCR), not capacity — and that's often a "use the API or free tier" situation rather than a full subscription.

The decision isn't really "which model." It's "which two ecosystems are worth paying into."

The honest limitation

I am one person doing one kind of work — heavy on writing, document reasoning, system architecture, governance, and household-coordination. Light on multimedia, real-time interaction, or production code at scale. The routing above is what works for that mix. Your task mix may push the routing in different directions.

The takeaway isn't "use Claude for X." It's:

Figure out the shape of your work, not the model categories.
Stop chasing "best model."
Accept that the routing will need to change every few months.
Build the habit of comparing on important work, not just defaulting.

The routing I described here will evolve. The four habits won't.

How I Choose Between Claude, ChatGPT, and Gemini

How I Choose Between Claude, ChatGPT, and Gemini

Why "just pick the best" doesn't work

The three dimensions that actually matter

1. Cognitive shape of the work

2. Trust posture

3. Capability access

My current routing (as of this writing)

Default work — Claude

Email, Calendar, Live Web — Gemini

Architecture critique, system engineering — Codex / GPT

When the routing is unclear

Anti-heuristics — things I stopped doing

The subscription question

The honest limitation

Read more

Tutorial Part 5: When Configuration Becomes Architecture

Tutorial Part 4: When the Agent Doesn't Know Enough

Tutorial Part 3: When the Agent Goes Off the Rails

Tutorial Part 2: When the Agent Forgets