The Foundry · Case Study

A blog generator that outscored Opus 4.6 and the internet’s top writers

Four consecutive production posts averaged 81.8 on an open 8-category benchmark. The best AI model in the world managed 65. The most-read writers topped out at 60.

01

Why we built this

Every AI content tool sold today claims its output is indistinguishable from a human’s. Almost none of them show any evidence. The reason is simple: there has never been an agreed way to measure whether a blog post is good.

For code, we have HumanEval. For reasoning, we have MMLU. For question answering, we have HotpotQA. For written marketing content, the industry runs on vibes and screenshots. Which means a business owner paying an agency or AI tool for blog posts has no way of knowing whether what they’re getting is any good, other than reading it and hoping.

That “hoping” is expensive. A typical outcome is six months of blog posts that nobody reads, that Google doesn’t rank, that ChatGPT doesn’t cite, and that end up quietly deprioritised on the marketing roadmap. Then the content budget gets cut and the business concludes AI content “doesn’t work.”

It does work, but only if the content is actually good. So before we could build a blog generator we were willing to put in front of a client, we had to build something that would tell us, honestly, whether a post was good. This page is what that looks like.

02

The benchmark

The benchmark is an automated scoring system. Feed it a blog post, it runs over 40 checks, and it returns a score out of 100 across eight categories. Every category is derived from something measurable in the text or its structure. No AI judges involved, we tested that and rejected it, because AI models tend to rate AI-written content generously, even when a human would obviously not.

Each category answers a specific question a business should be asking about its content.

Readability10%

Will a reader actually get through it? Short paragraphs, varied sentence length, no jargon walls, appropriate reading level for the audience.

SEO15%

Is the post built to rank in Google? Correct header structure, internal and external links, a meta description in the ideal length band, tagged topics, an FAQ where useful.

AI Evasion20%

Does the post read like a human wrote it? We check for the statistical fingerprints that give AI-written content away, based on published detection research.

GEO13%

Will ChatGPT, Perplexity, and Google AIO cite this post when users ask related questions? Structured signals those engines use to decide which sources to pull.

Engagement15%

Will a reader stick with it? Strong opening, scannable layout, real data points, a clear next step at the end.

E-E-A-T15%

Google's own quality framework. Is there a named author with a real bio, are the claims backed by citations, does the post read like an expert wrote it?

Content Depth7%

Is there real substance, or could this have been written about any topic? Measures specificity, unique entities, and data density.

Multimedia5%

Is there more than a wall of text? Images with alt text, tables, lists, embedded video, structural variety.

Grade bands: 90 and above is A+, 80 is A, 70 is B+, 60 is B, 50 is C, 40 is D. A perfect 100 is not reachable. The scoring is calibrated so that even excellent human writing tops out around 85.

No AI judges. No screenshots. Just measurable structure. The benchmark is reproducible by anyone with a post and the category definitions on this page.

, on why the scoring stands up to scrutiny
03

The contenders

We scored nine blog posts using identical benchmark settings, on the same day.

Our blog generator, 4 consecutive production posts

Four consecutive live blog posts from bespokeworks.ai/blog. We picked the four most recent posts at time of benchmarking, fetched them from the live site, and scored them exactly as they ship to readers. Nothing was reselected, rerun, or touched up before scoring.

Claude Opus 4.6 and GPT-4o (raw API calls)

Claude Opus 4.6 is the strongest general purpose AI model on the market in 2026. GPT-4o is the default cheap end model that most AI content tools quietly run under the hood. We asked each of them, via a single API call, to write a 2,000 word blog post on the same topic. This is the ceiling of what you get from a raw LLM without a pipeline around it.

Three long-form human blogs

Three of the most-read long-form business and tech writers on the internet. Each post fetched from their live site and run through the same benchmark:

These writers have, collectively, decades of practice and tens of millions of readers. To make the comparison fair, we gave each post the same structured metadata we gave our own output (a title, a meta description, topic tags). Nobody was disadvantaged by missing frontmatter.

04

The composite scores

Nine blog posts. One automated scorer. No re-runs, no cherry picking. The bar chart below puts every contender on the same scale.

Composite Score · 0–100

Ours: Admin CostsProduction post 83
Ours: AI Agent InvestmentProduction post 82
Ours: Own AI PlatformProduction post 82
Ours: Agent GovernanceProduction post 80
Claude Opus 4.6Raw API call 65
a16z, Big Ideas 2025Human · 12,996 words 60
Stratechery (Ben Thompson)Human · 4,239 words 58
GPT-4oRaw API call 55
Paul Graham, Great WorkHuman · 11,410 words 55

Every one of our four production posts outscored every other contender. Our lowest scoring post of the four (80) beat the highest scoring outside post (Opus 4.6 at 65) by 15 points. Our top post (83) beat Paul Graham by 28, a raw GPT-4o call by 28, and a raw Claude Opus call by 18.

See it on your business

Want to see what this pipeline produces for your business?

Drop in your website and the free analyser will surface your highest-impact content opportunity, the topic, angle, and structure most likely to rank, get cited, and bring you customers. About five minutes. No signup. Built on the same scoring system you’re reading about.

The model isn’t the bottleneck. The pipeline around it is. A 17-point margin doesn’t come from better prompts, it comes from everything that happens before and after the LLM is asked to write a sentence.

, on why the 17-point margin matters
05

Every category, every contender

The composite hides the detail. Here is every category, for every contender. Our production posts are highlighted.

Post Words Read SEO AI Evasion GEO Engage E-E-A-T Depth MM Total
Ours: Admin Costs3,414959076798584846283
Ours: AI Investment3,262948075848584926282
Ours: Own AI Platform3,334898675858383846282
Ours: Agent Governance3,492898067799184846280
Claude Opus 4.6 (raw)1,906825359817442843065
a16z, Big Ideas 202512,996645859765947783860
Stratechery, Thompson4,239535560586656723058
GPT-4o (raw)1,176634650636031843855
Paul Graham, Great Work11,410885561466336423055

Scroll horizontally on small screens →

06

Where the pipeline wins, and why

Our pipeline is not a better writer than Paul Graham. He would probably write a better essay than anything we’ll ever produce. The advantage shows up in the categories essayists and raw AI calls have no reason to optimise for, because they’re writing for a different job.

Readability: our average 91 vs Stratechery 53, Opus 82

Readability is less about style and more about whether a normal reader can actually get through a post without giving up. Our pipeline rewrites in multiple passes, checking sentence length variation, paragraph density, and reading level, until the scoring system is satisfied. Ben Thompson’s piece scored 53 because it is written in long, academic paragraphs for an audience that has subscribed to read exactly that. That works for him. It does not work for a small business trying to reach new customers.

SEO: our average 84 vs Opus 53, humans 55 to 58

SEO in this benchmark is not about keyword stuffing. It is about whether the structural ingredients Google needs are actually present: the right number of headers for the post length, a meta description in the correct 140 to 160 character band, internal links to other pages, external citations, and topic tags. A raw Opus call scores 53 here not because it writes badly, but because it has no concept of how many H2 headers a 3,400 word post should have. Our pipeline does, because the scoring system told us.

GEO: our average 82 vs Paul Graham 46, Stratechery 58

GEO, or Generative Engine Optimisation, is the newer version of SEO for the AI search era. ChatGPT, Perplexity, and Google’s AI Overviews pull from certain kinds of content more often than others: entity-first paragraphs, self-contained sections, cited statistics, definition patterns, and recent date signals. Paul Graham scores 46 not because his essays are bad, but because they are written as flowing arguments for humans, not as structured snippets for AI engines. In 2026, ranking in Google is only half the battle. The other half is being the source ChatGPT quotes.

E-E-A-T: our average 84 vs Opus 42, humans 36 to 56

E-E-A-T is Google’s own quality framework, standing for Experience, Expertise, Authoritativeness, Trustworthiness. It scores things like whether there is a named author, whether a real bio is attached, whether claims are backed by citations, and whether the writer hedges appropriately (real experts do; AI confidently does not). A raw Opus call has no author to attach to it. Paul Graham does, but famously does not cite sources. Every post our pipeline ships has a real author, a bio, external citations, and honest hedging where the evidence is genuinely uncertain.

07

What this means for a content programme

A benchmark number is only interesting if it changes something in the real world. The 20 point gap between our pipeline and a raw AI call shows up in four specific places for the businesses we work with.

01

Posts that can actually rank

An SEO score of 84 means every ingredient Google looks for is present: header structure, links, meta description, tags. Content that is merely "well written" but structurally wrong will sit at the bottom of search results forever. Our output is built to climb those results over time, which is how content marketing actually pays back.

02

Cited by ChatGPT and AI search

A growing share of searches now happen inside ChatGPT, Perplexity, and Google’s AI Overviews. Those engines pick sources based on structure, not just quality. A GEO score of 82 means our posts are in the format those engines prefer to cite. For a business whose customers are increasingly asking AI first, this is the difference between being mentioned and being invisible.

03

Trust signals baked in

Every post we ship has a named author with a real bio, genuine external citations, and language that hedges where the evidence is uncertain. An E-E-A-T score of 84 means the post looks like an expert wrote it, not a bot. That matters both to Google’s ranking and to the human reader deciding whether to trust what they’re reading.

04

Measured before it reaches you

Every post generated by our pipeline runs through this benchmark before it is delivered. Posts that score below threshold go back through the pipeline rather than out to a client. You do not receive the weak runs. You receive the posts that actually pass the bar we set internally, which is the same bar shown on this page.

08

Why even the best AI model only reaches 65

Claude Opus 4.6 is a remarkable model. On pure writing quality, a single Opus call is close to the best you can get from any AI today. But a production blog post in 2026 is not only a piece of writing. It is a structured content object that needs to carry author credentials, properly formatted headers, external citations, a meta description, topic tags, an FAQ where relevant, and internal links to other pages on the site.

None of that is a writing problem. It is a workflow problem. A single API call cannot produce it, however capable the underlying model is. Our pipeline runs a sequence of separate passes, each one responsible for a different part of the post: research, structural blueprint, first draft, structural edit, voice and rhythm edit, SEO and GEO pass, image generation, FAQ generation, internal linking, final scoring. The output is not a better model. It is a better assembly line.

This is why our lowest scoring post still outperformed Opus by 15 points, even though Opus is writing the underlying sentences in some of those passes. The categories that drag Opus down (SEO, E-E-A-T, multimedia) are the categories our pipeline handles before and after the model is involved.

The output is not a better model. It is a better assembly line. Prompt engineering has a ceiling. Pipeline engineering doesn’t.

, on why a 65 from Opus is the real upper bound for raw LLMs
09

Honest limits

Every number on this page came from a single run, on one day, across nine specific posts. We are not claiming the pipeline wins on every topic and every prompt. We are claiming that on this specific transparent test, the four production posts we picked all cleared 80, and the seven alternatives we benchmarked did not reach 66.

We will rerun this at every pipeline version bump and publish the updated numbers, including any posts that score badly. The benchmark itself is deterministic, so anyone with a post and the category definitions on this page can approximate the same test. The scoring rules are standard (reading ease, sentence variety, header density, citation counts, entity density, and so on). What we have not published is the pipeline behind it, because that is the part our clients pay for.

Run the same scoring on your content

Content built to rank, get cited, and pass a bar.

If you’re running a content programme and want the scoring system shown on this page running over every post before it ships, talk to us. We can walk through what a programme looks like for your business, what topics we would target, and what the first month would produce.


BespokeWorks

Worked with us? We'd love your feedback.

Your experience helps other businesses make the right decision.

Leave a Review on Trustpilot
100%
Custom Built
Global
Clients Served
Free
AI Analysis
Analysis running

View Your Roadmap →