We built an open scoring system for blog content across eight things businesses actually care about. Our blog generator averaged 81.8 out of 100 across four consecutive live posts. The best AI model in the world scored 65 on the same test. Three of the most-read human writers on the internet topped out at 60.
Every AI content tool sold today claims its output is indistinguishable from a human's. Almost none of them show any evidence. The reason is simple: there has never been an agreed way to measure whether a blog post is good.
For code, we have HumanEval. For reasoning, we have MMLU. For question answering, we have HotpotQA. For written marketing content, the industry runs on vibes and screenshots. Which means a business owner paying an agency or AI tool for blog posts has no way of knowing whether what they're getting is any good, other than reading it and hoping.
That "hoping" is expensive. A typical outcome is six months of blog posts that nobody reads, that Google doesn't rank, that ChatGPT doesn't cite, and that end up quietly deprioritised on the marketing roadmap. Then the content budget gets cut and the business concludes AI content "doesn't work."
It does work, but only if the content is actually good. So before we could build a blog generator we were willing to put in front of a client, we had to build something that would tell us, honestly, whether a post was good. This page is what that looks like.
The benchmark is an automated scoring system. Feed it a blog post, it runs over 40 checks, and it returns a score out of 100 across eight categories. Every category is derived from something measurable in the text or its structure. No AI judges involved (we tested that and rejected it, because AI models tend to rate AI-written content generously, even when a human would obviously not).
Each category answers a specific question a business should be asking about its content.
Grade bands: 90 and above is A+, 80 is A, 70 is B+, 60 is B, 50 is C, 40 is D. A perfect 100 is not reachable. The scoring is calibrated so that even excellent human writing tops out around 85.
We scored nine blog posts using identical benchmark settings, on the same day.
Four consecutive live blog posts from bespokeworks.ai/blog. We picked the four most recent posts at time of benchmarking, fetched them from the live site, and scored them exactly as they ship to readers. Nothing was reselected, rerun, or touched up before scoring.
Claude Opus 4.6 is the strongest general purpose AI model on the market in 2026. GPT-4o is the default cheap end model that most AI content tools quietly run under the hood. We asked each of them, via a single API call, to write a 2,000 word blog post on the same topic. This is the ceiling of what you get from a raw LLM without a pipeline around it.
Three of the most-read long-form business and tech writers on the internet. Each post fetched from their live site and run through the same benchmark:
These writers have, collectively, decades of practice and tens of millions of readers. To make the comparison fair, we gave each post the same structured metadata we gave our own output (a title, a meta description, topic tags). Nobody was disadvantaged by missing frontmatter.
Nine blog posts. One automated scorer. No re-runs, no cherry picking.
Every one of our four production posts outscored every other contender. Our lowest scoring post of the four (80) beat the highest scoring outside post (Opus 4.6 at 65) by 15 points. Our top post (83) beat Paul Graham by 28, a raw GPT-4o call by 28, and a raw Claude Opus call by 18.
The composite hides the detail. Here is every category, for every contender. Our production posts are highlighted.
| Post | Words | Read | SEO | AI Evasion | GEO | Engage | E-E-A-T | Depth | MM | Total |
|---|---|---|---|---|---|---|---|---|---|---|
| Ours: Admin Costs | 3,414 | 95 | 90 | 76 | 79 | 85 | 84 | 84 | 62 | 83 |
| Ours: AI Investment | 3,262 | 94 | 80 | 75 | 84 | 85 | 84 | 92 | 62 | 82 |
| Ours: Own AI Platform | 3,334 | 89 | 86 | 75 | 85 | 83 | 83 | 84 | 62 | 82 |
| Ours: Agent Governance | 3,492 | 89 | 80 | 67 | 79 | 91 | 84 | 84 | 62 | 80 |
| Claude Opus 4.6 (raw) | 1,906 | 82 | 53 | 59 | 81 | 74 | 42 | 84 | 30 | 65 |
| a16z, Big Ideas 2025 | 12,996 | 64 | 58 | 59 | 76 | 59 | 47 | 78 | 38 | 60 |
| Stratechery, Thompson | 4,239 | 53 | 55 | 60 | 58 | 66 | 56 | 72 | 30 | 58 |
| GPT-4o (raw) | 1,176 | 63 | 46 | 50 | 63 | 60 | 31 | 84 | 38 | 55 |
| Paul Graham, Great Work | 11,410 | 88 | 55 | 61 | 46 | 63 | 36 | 42 | 30 | 55 |
Our pipeline is not a better writer than Paul Graham. He would probably write a better essay than anything we'll ever produce. The advantage shows up in the categories essayists and raw AI calls have no reason to optimise for, because they're writing for a different job.
Readability is less about style and more about whether a normal reader can actually get through a post without giving up. Our pipeline rewrites in multiple passes, checking sentence length variation, paragraph density, and reading level, until the scoring system is satisfied. Ben Thompson's piece scored 53 because it is written in long, academic paragraphs for an audience that has subscribed to read exactly that. That works for him. It does not work for a small business trying to reach new customers.
SEO in this benchmark is not about keyword stuffing. It is about whether the structural ingredients Google needs are actually present: the right number of headers for the post length, a meta description in the correct 140 to 160 character band, internal links to other pages, external citations, and topic tags. A raw Opus call scores 53 here not because it writes badly, but because it has no concept of how many H2 headers a 3,400 word post should have. Our pipeline does, because the scoring system told us.
GEO, or Generative Engine Optimisation, is the newer version of SEO for the AI search era. ChatGPT, Perplexity, and Google's AI Overviews pull from certain kinds of content more often than others: entity-first paragraphs, self-contained sections, cited statistics, definition patterns, and recent date signals. Paul Graham scores 46 not because his essays are bad, but because they are written as flowing arguments for humans, not as structured snippets for AI engines. In 2026, ranking in Google is only half the battle. The other half is being the source ChatGPT quotes.
E-E-A-T is Google's own quality framework, standing for Experience, Expertise, Authoritativeness, Trustworthiness. It scores things like whether there is a named author, whether a real bio is attached, whether claims are backed by citations, and whether the writer hedges appropriately (real experts do; AI confidently does not). A raw Opus call has no author to attach to it. Paul Graham does, but famously does not cite sources. Every post our pipeline ships has a real author, a bio, external citations, and honest hedging where the evidence is genuinely uncertain.
A benchmark number is only interesting if it changes something in the real world. The 20 point gap between our pipeline and a raw AI call shows up in four specific places for the businesses we work with.
Claude Opus 4.6 is a remarkable model. On pure writing quality, a single Opus call is close to the best you can get from any AI today. But a production blog post in 2026 is not only a piece of writing. It is a structured content object that needs to carry author credentials, properly formatted headers, external citations, a meta description, topic tags, an FAQ where relevant, and internal links to other pages on the site.
None of that is a writing problem. It is a workflow problem. A single API call cannot produce it, however capable the underlying model is. Our pipeline runs a sequence of separate passes, each one responsible for a different part of the post: research, structural blueprint, first draft, structural edit, voice and rhythm edit, SEO and GEO pass, image generation, FAQ generation, internal linking, final scoring. The output is not a better model. It is a better assembly line.
This is why our lowest scoring post still outperformed Opus by 15 points, even though Opus is writing the underlying sentences in some of those passes. The categories that drag Opus down (SEO, E-E-A-T, multimedia) are the categories our pipeline handles before and after the model is involved.
Every number on this page came from a single run, on one day, across nine specific posts. We are not claiming the pipeline wins on every topic and every prompt. We are claiming that on this specific transparent test, the four production posts we picked all cleared 80, and the seven alternatives we benchmarked did not reach 66.
We will rerun this at every pipeline version bump and publish the updated numbers, including any posts that score badly. The benchmark itself is deterministic, so anyone with a post and the category definitions on this page can approximate the same test. The scoring rules are standard (reading ease, sentence variety, header density, citation counts, entity density, and so on). What we have not published is the pipeline behind it, because that is the part our clients pay for.
If you're running a content programme and want the scoring system shown on this page running over every post before it ships, talk to us. We can walk through what a programme looks like for your business, what topics we would target, and what the first month would produce.
Your experience helps other businesses make the right decision.
Leave a Review on Trustpilot