Everyone has an opinion about which AI model is best.

Most of those opinions are based on one task, tested once, evaluated subjectively.

We ran Claude 3.7 Sonnet, ChatGPT-4o, and Gemini 2.5 Pro through 12 real, professionally relevant tasks — the kind operators actually use these tools for — and evaluated each one on speed, accuracy, output quality, and instruction-following.

The results were not what the marketing teams at Anthropic, OpenAI, or Google would want you to see.

Here’s everything.

How We Ran the Tests

The rule: Same prompt. Same RCTF structure. Identical input. No re-prompting or follow-up clarification allowed on the first pass. Results evaluated blind before we knew which model produced which output.

The tasks:

#Task Category / Specific Task

1 / Writing / First draft of a 600-word professional email campaign

2 / Writing / Rewrite a weak paragraph into three tone variations

3 / Research / Synthesize a briefing from a complex 3,000-word input document

4 / Research / Identify the 5 strongest counterarguments to a stated position

5 / Analysis / Build a structured comparison of 3 options with pros/cons/recommendation

6 / Analysis / Evaluate a business plan section and identify the 3 biggest risks

7 / Operations / Create a project timeline from a vague brief

8 / Operations / Write a Standard Operating Procedure from a verbal description

9 / Strategy / Generate 10 differentiated positioning angles for a product

10 / Strategy / Identify the fastest path to a stated business goal given constraints

11 / Coding/Technical / Write a Python function with error handling and comments

12 / Instruction-Following / Follow a 7-step, multi-format prompt exactly as specified

Evaluation criteria (each scored 1–5):

Instruction-following: Did it do exactly what was asked?

Output quality: Was the result genuinely useful?

Format accuracy: Did it respect the specified format?

Specificity: Was the response specific or generic?

The Results

Writing Tasks (Tasks 1 & 2)

Winner: Claude 3.7 Sonnet — by a significant margin.

Claude’s writing output was consistently more varied, more human in cadence, and more responsive to tone instructions. When asked to produce three tone variations (professional, casual, urgent), Claude produced three genuinely distinct outputs. ChatGPT’s three variations were noticeably similar to each other — the “casual” version sounded like a slightly loosened version of the “professional” version. Gemini produced the most generic outputs of the three.

On the 600-word email campaign draft, Claude was the only model that opened with a hook that didn’t sound like a template. Both ChatGPT and Gemini defaulted to subject lines and openers that felt familiar — the “we’re excited to share” energy.

Instruction-following score (Writing): Claude 4.8 / ChatGPT 4.1 / Gemini 3.6

Research Tasks (Tasks 3 & 4)

Winner: Gemini 2.5 Pro — narrowly, with an asterisk.

Gemini’s synthesis of the 3,000-word input document was tighter and better-structured than the other two. It pulled the most relevant threads and organized them clearly. However, it hallucinated one attribution — citing a data point that appeared to come from the input document but didn’t. This is the asterisk.

On the counterargument task (Task 4), Claude produced the most rigorous opposing arguments — including one angle that neither of the other models identified. ChatGPT produced counterarguments that were technically correct but somewhat predictable. Gemini’s were the weakest, tending toward abstract rather than specific challenges.

For research: use Gemini for synthesis and structure, but verify every factual claim. Use Claude when you need unconventional angles and rigorous critical thinking.

Research scores: Gemini 4.4 / Claude 4.2 / ChatGPT 3.9

Analysis Tasks (Tasks 5 & 6)

Winner: Claude — consistently.

The comparison task (Task 5) revealed a meaningful gap. Claude’s output included a recommendation section that actually committed to a recommendation with reasoning. ChatGPT’s hedged (“it depends on your priorities”). Gemini’s was structured beautifully but avoided committing to any direction.

On the business plan risk evaluation (Task 6), Claude identified risks that required reading between the lines of the brief — implications and second-order risks the brief didn’t explicitly state. The other two models identified only the surface-level risks that were obviously present.

Operators note: When you need analysis that goes beyond the obvious, Claude’s tendency to reason through implications is genuinely valuable.

Analysis scores: Claude 4.7 / ChatGPT 4.0 / Gemini 3.8

Operations Tasks (Tasks 7 & 8)

Winner: ChatGPT — surprisingly strong here.

The project timeline task revealed ChatGPT’s strength with structured operational output. Its timeline was the most practical and included realistic buffer time between phases — something operators who’ve actually run projects would appreciate. Claude’s timeline was thorough but slightly over-optimistic. Gemini’s was visually organized but missed two dependencies mentioned in the brief.

The SOP task was the most revealing of the three. ChatGPT produced an SOP that could be handed to a new hire immediately. Claude’s was excellent but slightly more verbose than needed. Gemini’s format was strong but the content was thin in places.

For operations documentation, ChatGPT is the most immediately deployable.

Operations scores: ChatGPT 4.5 / Claude 4.1 / Gemini 3.7

Strategy Tasks (Tasks 9 & 10)

Winner: Claude — by the largest margin of the entire test.

The positioning angles task (Task 9) is where the gap between Claude and the other two became most apparent. Claude produced 10 genuinely differentiated angles, including three that could serve as the foundation for a full brand strategy. ChatGPT produced 10 angles, but 3 of them were variations of the same idea. Gemini produced thoughtful angles but they were more generic — angles that would apply to nearly any product in the category.

Task 10 (fastest path to goal given constraints) showed Claude’s strongest capability: holding complexity, weighing constraints, and producing a prioritized path that actually accounted for the trade-offs stated in the prompt. The other two produced action plans. Claude produced a strategy.

For any work requiring genuine strategic reasoning, Claude is in a class by itself.

Strategy scores: Claude 4.9 / ChatGPT 3.8 / Gemini 3.5

Coding Task (Task 11)

Winner: All three — this was the most even category.

All three models produced correct, working Python functions with appropriate error handling and comments. ChatGPT’s code was the most conventional and cleanest for readability. Claude’s included the most thorough edge-case handling. Gemini’s was functional but included slightly redundant comments.

For coding tasks, any of the three will serve you well. Pick based on what you’re using for the rest of the project.

Instruction-Following Test (Task 12)

Winner: Claude — by a lot.

Task 12 was a deliberate stress test: a 7-step prompt requiring specific formats, specific exclusions, specific length constraints, and a non-standard output structure. This is the test that matters most for operators, because complex prompts are the norm, not the exception.

Claude followed every instruction. All 7 steps. Exact formats. Correct exclusions. Length within spec.

ChatGPT missed two format requirements and ignored one exclusion.

Gemini followed 4 of 7 steps correctly, drifted on format in two places, and added content that was explicitly excluded in the instructions.

Instruction-following score: Claude 5.0 / ChatGPT 3.9 / Gemini 3.2

The Summary — Which Model for Which Work

Use Case / Best Model / Second Choice

Writing (drafts, emails, copy) / Claude / ChatGPT

Long-form synthesis / research / Gemini / Claude

Critical analysis / strategy / Claude / — (significant gap)

Operations docs / SOPs / ChatGPT / Claude

Counterargument / opposing views / Claude / ChatGPT

Complex instruction-following / Claude / — (significant gap)

Coding / ChatGPT / Claude (tied)

Positioning / brand strategy / Claude / —

The Honest Takeaway

Claude is the best model for operators doing knowledge work that requires reasoning, writing quality, and strict instruction-following. It’s not perfect — its research synthesis has occasional accuracy issues and it can be verbose on operations tasks.

ChatGPT is the most balanced model. It won’t win many categories outright, but it won’t lose badly in any. If you want one model for everything, ChatGPT is the safest default.

Gemini’s synthesis capability is genuinely strong — but its instruction-following failures make it risky for complex operator workflows. Use it for specific research tasks under supervision.

The operators we know who use AI most effectively don’t pick one model. They route tasks. Writing and strategy go to Claude. Documentation goes to ChatGPT. Research synthesis goes to Gemini — then gets verified.

That’s the system. Now you have it.

The Nova AI Operator Pack includes 20 prompts pre-optimized and labeled by the model they perform best on. So you’re not guessing — you’re deploying the right prompt to the right model for the right task.

At $17, it’s a one-time reference library you’ll reach for every week.

→ Get the Operator Pack — novamedia42.gumroad.com

— Nova AI

The system for operators. Free, weekly, no fluff.

P.S. Anthropic just changed what’s available to small businesses. Next issue: what they announced, what it actually means, and your 48-hour action plan to take advantage of it.

Nova AI | novaai.media | 617 Vista San Javier, San Diego CA

Keep Reading