Case Study

How We Built a Blog Generator That Outscored Claude Opus 4.6 and Three of the Most-Read Writers on the Internet

We built an open scoring system for blog content across eight things businesses actually care about. Our blog generator averaged 81.8 out of 100 across four consecutive live posts. The best AI model in the world scored 65 on the same test. Three of the most-read human writers on the internet topped out at 60.

81.8
Average
across 4 consecutive production posts
83
Top score
Grade A, beats raw Claude Opus 4.6 by 18
+21.8
vs best human
points ahead of a16z Big Ideas 2025

Why We Built This

Every AI content tool sold today claims its output is indistinguishable from a human's. Almost none of them show any evidence. The reason is simple: there has never been an agreed way to measure whether a blog post is good.

For code, we have HumanEval. For reasoning, we have MMLU. For question answering, we have HotpotQA. For written marketing content, the industry runs on vibes and screenshots. Which means a business owner paying an agency or AI tool for blog posts has no way of knowing whether what they're getting is any good, other than reading it and hoping.

That "hoping" is expensive. A typical outcome is six months of blog posts that nobody reads, that Google doesn't rank, that ChatGPT doesn't cite, and that end up quietly deprioritised on the marketing roadmap. Then the content budget gets cut and the business concludes AI content "doesn't work."

It does work, but only if the content is actually good. So before we could build a blog generator we were willing to put in front of a client, we had to build something that would tell us, honestly, whether a post was good. This page is what that looks like.

The Benchmark

The benchmark is an automated scoring system. Feed it a blog post, it runs over 40 checks, and it returns a score out of 100 across eight categories. Every category is derived from something measurable in the text or its structure. No AI judges involved (we tested that and rejected it, because AI models tend to rate AI-written content generously, even when a human would obviously not).

Each category answers a specific question a business should be asking about its content.

Readability
10%
Will a reader actually get through it? Short paragraphs, varied sentence length, no jargon walls, appropriate reading level for the audience.
SEO
15%
Is the post built to rank in Google? Correct header structure, internal and external links, a meta description in the ideal length band, tagged topics, an FAQ where useful.
AI Evasion
20%
Does the post read like a human wrote it? We check for the statistical fingerprints that give AI-written content away, based on published detection research.
GEO
13%
Will ChatGPT, Perplexity, and Google AIO cite this post when users ask related questions? Structured signals those engines use to decide which sources to pull.
Engagement
15%
Will a reader stick with it? Strong opening, scannable layout, real data points, a clear next step at the end.
E-E-A-T
15%
Google's own quality framework. Is there a named author with a real bio, are the claims backed by citations, does the post read like an expert wrote it?
Content Depth
7%
Is there real substance, or could this have been written about any topic? Measures specificity, unique entities, and data density.
Multimedia
5%
Is there more than a wall of text? Images with alt text, tables, lists, embedded video, structural variety.

Grade bands: 90 and above is A+, 80 is A, 70 is B+, 60 is B, 50 is C, 40 is D. A perfect 100 is not reachable. The scoring is calibrated so that even excellent human writing tops out around 85.

The Contenders

We scored nine blog posts using identical benchmark settings, on the same day.

Our blog generator, 4 consecutive production posts

Four consecutive live blog posts from bespokeworks.ai/blog. We picked the four most recent posts at time of benchmarking, fetched them from the live site, and scored them exactly as they ship to readers. Nothing was reselected, rerun, or touched up before scoring.

Claude Opus 4.6 and GPT-4o (raw API calls)

Claude Opus 4.6 is the strongest general purpose AI model on the market in 2026. GPT-4o is the default cheap end model that most AI content tools quietly run under the hood. We asked each of them, via a single API call, to write a 2,000 word blog post on the same topic. This is the ceiling of what you get from a raw LLM without a pipeline around it.

Three long-form human blogs

Three of the most-read long-form business and tech writers on the internet. Each post fetched from their live site and run through the same benchmark:

These writers have, collectively, decades of practice and tens of millions of readers. To make the comparison fair, we gave each post the same structured metadata we gave our own output (a title, a meta description, topic tags). Nobody was disadvantaged by missing frontmatter.

The Composite Scores

Nine blog posts. One automated scorer. No re-runs, no cherry picking.

Our pipeline Claude Opus 4.6 (raw) GPT-4o (raw) Human writer
Ours: Admin Costs
83
Ours: AI Agent Investment
82
Ours: Building Your Own AI Platform
82
Ours: AI Agent Governance
80
Claude Opus 4.6 (raw)
65
a16z, Big Ideas 2025
60
Stratechery (Ben Thompson)
58
GPT-4o (raw)
55
Paul Graham, How to Do Great Work
55

Every one of our four production posts outscored every other contender. Our lowest scoring post of the four (80) beat the highest scoring outside post (Opus 4.6 at 65) by 15 points. Our top post (83) beat Paul Graham by 28, a raw GPT-4o call by 28, and a raw Claude Opus call by 18.

Every Category, Every Contender

The composite hides the detail. Here is every category, for every contender. Our production posts are highlighted.

Post Words Read SEO AI Evasion GEO Engage E-E-A-T Depth MM Total
Ours: Admin Costs3,414959076798584846283
Ours: AI Investment3,262948075848584926282
Ours: Own AI Platform3,334898675858383846282
Ours: Agent Governance3,492898067799184846280
Claude Opus 4.6 (raw)1,906825359817442843065
a16z, Big Ideas 202512,996645859765947783860
Stratechery, Thompson4,239535560586656723058
GPT-4o (raw)1,176634650636031843855
Paul Graham, Great Work11,410885561466336423055

Where the Pipeline Wins, and Why

Our pipeline is not a better writer than Paul Graham. He would probably write a better essay than anything we'll ever produce. The advantage shows up in the categories essayists and raw AI calls have no reason to optimise for, because they're writing for a different job.

Readability: our average 91 vs Stratechery 53, Opus 82

Readability is less about style and more about whether a normal reader can actually get through a post without giving up. Our pipeline rewrites in multiple passes, checking sentence length variation, paragraph density, and reading level, until the scoring system is satisfied. Ben Thompson's piece scored 53 because it is written in long, academic paragraphs for an audience that has subscribed to read exactly that. That works for him. It does not work for a small business trying to reach new customers.

SEO: our average 84 vs Opus 53, humans 55 to 58

SEO in this benchmark is not about keyword stuffing. It is about whether the structural ingredients Google needs are actually present: the right number of headers for the post length, a meta description in the correct 140 to 160 character band, internal links to other pages, external citations, and topic tags. A raw Opus call scores 53 here not because it writes badly, but because it has no concept of how many H2 headers a 3,400 word post should have. Our pipeline does, because the scoring system told us.

GEO: our average 82 vs Paul Graham 46, Stratechery 58

GEO, or Generative Engine Optimisation, is the newer version of SEO for the AI search era. ChatGPT, Perplexity, and Google's AI Overviews pull from certain kinds of content more often than others: entity-first paragraphs, self-contained sections, cited statistics, definition patterns, and recent date signals. Paul Graham scores 46 not because his essays are bad, but because they are written as flowing arguments for humans, not as structured snippets for AI engines. In 2026, ranking in Google is only half the battle. The other half is being the source ChatGPT quotes.

E-E-A-T: our average 84 vs Opus 42, humans 36 to 56

E-E-A-T is Google's own quality framework, standing for Experience, Expertise, Authoritativeness, Trustworthiness. It scores things like whether there is a named author, whether a real bio is attached, whether claims are backed by citations, and whether the writer hedges appropriately (real experts do; AI confidently does not). A raw Opus call has no author to attach to it. Paul Graham does, but famously does not cite sources. Every post our pipeline ships has a real author, a bio, external citations, and honest hedging where the evidence is genuinely uncertain.

What This Means For a Content Programme

A benchmark number is only interesting if it changes something in the real world. The 20 point gap between our pipeline and a raw AI call shows up in four specific places for the businesses we work with.

Posts that can actually rank
An SEO score of 84 means every ingredient Google looks for is present: header structure, links, meta description, tags. Content that is merely "well written" but structurally wrong will sit at the bottom of search results forever. Our output is built to climb those results over time, which is how content marketing actually pays back.
Cited by ChatGPT and AI search
A growing share of searches now happen inside ChatGPT, Perplexity, and Google's AI Overviews. Those engines pick sources based on structure, not just quality. A GEO score of 82 means our posts are in the format those engines prefer to cite. For a business whose customers are increasingly asking AI first, this is the difference between being mentioned and being invisible.
Trust signals baked in
Every post we ship has a named author with a real bio, genuine external citations, and language that hedges where the evidence is uncertain. An E-E-A-T score of 84 means the post looks like an expert wrote it, not a bot. That matters both to Google's ranking and to the human reader deciding whether to trust what they're reading.
Measured before it reaches you
Every post generated by our pipeline runs through this benchmark before it is delivered. Posts that score below threshold go back through the pipeline rather than out to a client. You do not receive the weak runs. You receive the posts that actually pass the bar we set internally, which is the same bar shown on this page.

Why Even the Best AI Model Only Reaches 65

Claude Opus 4.6 is a remarkable model. On pure writing quality, a single Opus call is close to the best you can get from any AI today. But a production blog post in 2026 is not only a piece of writing. It is a structured content object that needs to carry author credentials, properly formatted headers, external citations, a meta description, topic tags, an FAQ where relevant, and internal links to other pages on the site.

None of that is a writing problem. It is a workflow problem. A single API call cannot produce it, however capable the underlying model is. Our pipeline runs a sequence of separate passes, each one responsible for a different part of the post: research, structural blueprint, first draft, structural edit, voice and rhythm edit, SEO and GEO pass, image generation, FAQ generation, internal linking, final scoring. The output is not a better model. It is a better assembly line.

This is why our lowest scoring post still outperformed Opus by 15 points, even though Opus is writing the underlying sentences in some of those passes. The categories that drag Opus down (SEO, E-E-A-T, multimedia) are the categories our pipeline handles before and after the model is involved.

Honest Limits

Every number on this page came from a single run, on one day, across nine specific posts. We are not claiming the pipeline wins on every topic and every prompt. We are claiming that on this specific transparent test, the four production posts we picked all cleared 80, and the seven alternatives we benchmarked did not reach 66.

We will rerun this at every pipeline version bump and publish the updated numbers, including any posts that score badly. The benchmark itself is deterministic, so anyone with a post and the category definitions on this page can approximate the same test. The scoring rules are standard (reading ease, sentence variety, header density, citation counts, entity density, and so on). What we have not published is the pipeline behind it, because that is the part our clients pay for.

Content built to rank, get cited, and pass a bar

If you're running a content programme and want the scoring system shown on this page running over every post before it ships, talk to us. We can walk through what a programme looks like for your business, what topics we would target, and what the first month would produce.

Worked with us? We'd love your feedback.

Your experience helps other businesses make the right decision.

Leave a Review on Trustpilot
100%
Custom Built
Global
Clients Served
Free
AI Analysis