The Foundry · Case Study

A blog generator that outscored Opus 4.6 and the internet’s top writers

Four consecutive production posts averaged 81.8 on an open 8-category benchmark. The best AI model in the world managed 65. The most-read writers topped out at 60.

Why we built this

Every AI content tool sold today claims its output is indistinguishable from a human’s. Almost none of them show any evidence. The reason is simple: there has never been an agreed way to measure whether a blog post is good.

For code, we have HumanEval. For reasoning, we have MMLU. For question answering, we have HotpotQA. For written marketing content, the industry runs on vibes and screenshots. Which means a business owner paying an agency or AI tool for blog posts has no way of knowing whether what they’re getting is any good, other than reading it and hoping.

That “hoping” is expensive. A typical outcome is six months of blog posts that nobody reads, that Google doesn’t rank, that ChatGPT doesn’t cite, and that end up quietly deprioritised on the marketing roadmap. Then the content budget gets cut and the business concludes AI content “doesn’t work.”

It does work, but only if the content is actually good. So before we could build a blog generator we were willing to put in front of a client, we had to build something that would tell us, honestly, whether a post was good. This page is what that looks like.

The benchmark

The benchmark is an automated scoring system. Feed it a blog post, it runs over 40 checks, and it returns a score out of 100 across eight categories. Every category is derived from something measurable in the text or its structure. No AI judges involved, we tested that and rejected it, because AI models tend to rate AI-written content generously, even when a human would obviously not.

Each category answers a specific question a business should be asking about its content.

Methodology · at a glance

Dataset: 9 long-form blog posts (4 ours / 3 humans / 2 raw model)
Scorer: 8-category automated rubric (open methodology)
Categories: Readability, SEO, AI Evasion, GEO, Engagement, E-E-A-T, Depth, Multimedia
Word range: 1,400–4,500 words (ours); 11k+ for long human essays
Reproducibility: Open scorer · any researcher can replicate
Last run: April 2026

Readability10%

Will a reader actually get through it? Short paragraphs, varied sentence length, no jargon walls, appropriate reading level for the audience.

SEO15%

Is the post built to rank in Google? Correct header structure, internal and external links, a meta description in the ideal length band, tagged topics, an FAQ where useful.

AI Evasion20%

Does the post read like a human wrote it? We check for the statistical fingerprints that give AI-written content away, based on published detection research.

GEO13%

Will ChatGPT, Perplexity, and Google AIO cite this post when users ask related questions? Structured signals those engines use to decide which sources to pull.

Engagement15%

Will a reader stick with it? Strong opening, scannable layout, real data points, a clear next step at the end.

E-E-A-T15%

Google's own quality framework. Is there a named author with a real bio, are the claims backed by citations, does the post read like an expert wrote it?

Content Depth7%

Is there real substance, or could this have been written about any topic? Measures specificity, unique entities, and data density.

Multimedia5%

Is there more than a wall of text? Images with alt text, tables, lists, embedded video, structural variety.

Grade bands: 90 and above is A+, 80 is A, 70 is B+, 60 is B, 50 is C, 40 is D. A perfect 100 is not reachable. The scoring is calibrated so that even excellent human writing tops out around 85.

No AI judges. No screenshots. Just measurable structure. The benchmark is reproducible by anyone with a post and the category definitions on this page.
, on why the scoring stands up to scrutiny

The contenders

We scored nine blog posts using identical benchmark settings, on the same day.

Our blog generator, 4 consecutive production posts

Four consecutive live blog posts from bespokeworks.ai/blog. We picked the four most recent posts at time of benchmarking, fetched them from the live site, and scored them exactly as they ship to readers. Nothing was reselected, rerun, or touched up before scoring.

Claude Opus 4.6 and GPT-4o (raw API calls)

Claude Opus 4.6 is the strongest general purpose AI model on the market in 2026. GPT-4o is the default cheap end model that most AI content tools quietly run under the hood. We asked each of them, via a single API call, to write a 2,000 word blog post on the same topic. This is the ceiling of what you get from a raw LLM without a pipeline around it.

Three long-form human blogs

Three of the most-read long-form business and tech writers on the internet. Each post fetched from their live site and run through the same benchmark:

Ben Thompson (Stratechery), “Enterprise Philosophy and The First Wave of AI”, 4,239 words.
Andreessen Horowitz (a16z), “Big Ideas in Tech 2025”, 12,996 words.
Paul Graham, “How to Do Great Work”, 11,410 words.

These writers have, collectively, decades of practice and tens of millions of readers. To make the comparison fair, we gave each post the same structured metadata we gave our own output (a title, a meta description, topic tags). Nobody was disadvantaged by missing frontmatter.

The composite scores

Nine blog posts. One automated scorer. No re-runs, no cherry picking. The bar chart below puts every contender on the same scale.

Composite Score · 0–100

Ours: Admin CostsProduction post 83

Ours: AI Agent InvestmentProduction post 82

Ours: Own AI PlatformProduction post 82

Ours: Agent GovernanceProduction post 80

Claude Opus 4.6Raw API call 65

a16z, Big Ideas 2025Human · 12,996 words 60

Stratechery (Ben Thompson)Human · 4,239 words 58

GPT-4oRaw API call 55

Paul Graham, Great WorkHuman · 11,410 words 55

Every one of our four production posts outscored every other contender. Our lowest scoring post of the four (80) beat the highest scoring outside post (Opus 4.6 at 65) by 15 points. Our top post (83) beat Paul Graham by 28, a raw GPT-4o call by 28, and a raw Claude Opus call by 18.

See it on your business

Want to see what this pipeline produces for your business?

Drop in your website and the free analyser will surface your highest-impact content opportunity, the topic, angle, and structure most likely to rank, get cited, and bring you customers. About five minutes. No signup. Built on the same scoring system you’re reading about.

Run the free analyser Or see the AI content engine

The model isn’t the bottleneck. The pipeline around it is. A 17-point margin doesn’t come from better prompts, it comes from everything that happens before and after the LLM is asked to write a sentence.
, on why the 17-point margin matters

Every category, every contender

The composite hides the detail. Here is every category, for every contender. Our production posts are highlighted.

Post	Words	Read	SEO	AI Evasion	GEO	Engage	E-E-A-T	Depth	MM	Total
Ours: Admin Costs	3,414	95	90	76	79	85	84	84	62	83
Ours: AI Investment	3,262	94	80	75	84	85	84	92	62	82
Ours: Own AI Platform	3,334	89	86	75	85	83	83	84	62	82
Ours: Agent Governance	3,492	89	80	67	79	91	84	84	62	80
Claude Opus 4.6 (raw)	1,906	82	53	59	81	74	42	84	30	65
a16z, Big Ideas 2025	12,996	64	58	59	76	59	47	78	38	60
Stratechery, Thompson	4,239	53	55	60	58	66	56	72	30	58
GPT-4o (raw)	1,176	63	46	50	63	60	31	84	38	55
Paul Graham, Great Work	11,410	88	55	61	46	63	36	42	30	55

Scroll horizontally on small screens →

Where the pipeline wins, and why

Our pipeline is not a better writer than Paul Graham. He would probably write a better essay than anything we’ll ever produce. The advantage shows up in the categories essayists and raw AI calls have no reason to optimise for, because they’re writing for a different job.

Readability: our average 91 vs Stratechery 53, Opus 82

Readability is less about style and more about whether a normal reader can actually get through a post without giving up. Our pipeline rewrites in multiple passes, checking sentence length variation, paragraph density, and reading level, until the scoring system is satisfied. Ben Thompson’s piece scored 53 because it is written in long, academic paragraphs for an audience that has subscribed to read exactly that. That works for him. It does not work for a small business trying to reach new customers.

SEO: our average 84 vs Opus 53, humans 55 to 58

SEO in this benchmark is not about keyword stuffing. It is about whether the structural ingredients Google needs are actually present: the right number of headers for the post length, a meta description in the correct 140 to 160 character band, internal links to other pages, external citations, and topic tags. A raw Opus call scores 53 here not because it writes badly, but because it has no concept of how many H2 headers a 3,400 word post should have. Our pipeline does, because the scoring system told us.

GEO: our average 82 vs Paul Graham 46, Stratechery 58

GEO, or Generative Engine Optimisation, is the newer version of SEO for the AI search era. ChatGPT, Perplexity, and Google’s AI Overviews pull from certain kinds of content more often than others: entity-first paragraphs, self-contained sections, cited statistics, definition patterns, and recent date signals. Paul Graham scores 46 not because his essays are bad, but because they are written as flowing arguments for humans, not as structured snippets for AI engines. In 2026, ranking in Google is only half the battle. The other half is being the source ChatGPT quotes.

E-E-A-T: our average 84 vs Opus 42, humans 36 to 56

E-E-A-T is Google’s own quality framework, standing for Experience, Expertise, Authoritativeness, Trustworthiness. It scores things like whether there is a named author, whether a real bio is attached, whether claims are backed by citations, and whether the writer hedges appropriately (real experts do; AI confidently does not). A raw Opus call has no author to attach to it. Paul Graham does, but famously does not cite sources. Every post our pipeline ships has a real author, a bio, external citations, and honest hedging where the evidence is genuinely uncertain.

What this means for a content programme

A benchmark number is only interesting if it changes something in the real world. The 20 point gap between our pipeline and a raw AI call shows up in four specific places for the businesses we work with.

Posts that can actually rank

An SEO score of 84 means every ingredient Google looks for is present: header structure, links, meta description, tags. Content that is merely "well written" but structurally wrong will sit at the bottom of search results forever. Our output is built to climb those results over time, which is how content marketing actually pays back.

Cited by ChatGPT and AI search

A growing share of searches now happen inside ChatGPT, Perplexity, and Google’s AI Overviews. Those engines pick sources based on structure, not just quality. A GEO score of 82 means our posts are in the format those engines prefer to cite. For a business whose customers are increasingly asking AI first, this is the difference between being mentioned and being invisible.

Trust signals baked in

Every post we ship has a named author with a real bio, genuine external citations, and language that hedges where the evidence is uncertain. An E-E-A-T score of 84 means the post looks like an expert wrote it, not a bot. That matters both to Google’s ranking and to the human reader deciding whether to trust what they’re reading.

Measured before it reaches you

Every post generated by our pipeline runs through this benchmark before it is delivered. Posts that score below threshold go back through the pipeline rather than out to a client. You do not receive the weak runs. You receive the posts that actually pass the bar we set internally, which is the same bar shown on this page.

Why even the best AI model only reaches 65

Claude Opus 4.6 is a remarkable model. On pure writing quality, a single Opus call is close to the best you can get from any AI today. But a production blog post in 2026 is not only a piece of writing. It is a structured content object that needs to carry author credentials, properly formatted headers, external citations, a meta description, topic tags, an FAQ where relevant, and internal links to other pages on the site.

None of that is a writing problem. It is a workflow problem. A single API call cannot produce it, however capable the underlying model is. Our pipeline runs a sequence of separate passes, each one responsible for a different part of the post: research, structural blueprint, first draft, structural edit, voice and rhythm edit, SEO and GEO pass, image generation, FAQ generation, internal linking, final scoring. The output is not a better model. It is a better assembly line.

This is why our lowest scoring post still outperformed Opus by 15 points, even though Opus is writing the underlying sentences in some of those passes. The categories that drag Opus down (SEO, E-E-A-T, multimedia) are the categories our pipeline handles before and after the model is involved.

The output is not a better model. It is a better assembly line. Prompt engineering has a ceiling. Pipeline engineering doesn’t.
, on why a 65 from Opus is the real upper bound for raw LLMs

Honest limits

Every number on this page came from a single run, on one day, across nine specific posts. We are not claiming the pipeline wins on every topic and every prompt. We are claiming that on this specific transparent test, the four production posts we picked all cleared 80, and the seven alternatives we benchmarked did not reach 66.

We will rerun this at every pipeline version bump and publish the updated numbers, including any posts that score badly. The benchmark itself is deterministic, so anyone with a post and the category definitions on this page can approximate the same test. The scoring rules are standard (reading ease, sentence variety, header density, citation counts, entity density, and so on). What we have not published is the pipeline behind it, because that is the part our clients pay for.

Where this lands

In your industry.

Measurable content production matters wherever search visibility, AI citation, or trust signals decide whether your business gets discovered. Three sectors where this pipeline fits naturally:

Marketing & Content An end-to-end content programme For teams running blogs, newsletters, or thought-leadership at scale. Every post scored, every weak run sent back through the pipeline before it ships. SEO and GEO built in, not bolted on. See the AI content engine → SaaS & Software Docs, changelogs & product content SaaS companies need a steady stream of structured technical content (release notes, feature explainers, integration guides) that both ranks in Google and gets cited by ChatGPT when prospects ask comparison questions. See SaaS use cases → Professional Services Expertise content that builds trust Consultancies, agencies, law and accounting firms live or die by E-E-A-T. Named authors, real bios, cited claims, and structure that signals expertise, the pieces an automated scorer rewards and a prospect needs to see. See professional services AI →

Run the same scoring on your content

Content built to rank, get cited, and pass a bar.

If you’re running a content programme and want the scoring system shown on this page running over every post before it ships, talk to us. We can walk through what a programme looks like for your business, what topics we would target, and what the first month would produce.

Get my free AI report See the AI content engine