Case Study

How We Beat Every Published Benchmark for Multi-Hop Question Answering

A training-free RAG system that outperforms graph-based and fine-tuned alternatives on HotpotQA.

The Problem: When AI Hallucinates

Large Language Models (LLMs) are powerful, but they have a fundamental flaw: they hallucinate. They generate plausible-sounding but factually incorrect information. This is a major problem for businesses that need accurate, verifiable answers from their data.

To combat this, a technique called Retrieval Augmented Generation (RAG) was developed. Instead of relying solely on the LLM's internal knowledge, RAG systems first retrieve relevant information from a trusted knowledge base (like your company documents) and then provide that information to the LLM as context for generating an answer.

A typical RAG pipeline looks like this:

  1. Retrieve: Given a user's question, search a database of documents to find relevant passages.
  2. Augment: Combine the original question with the retrieved passages, forming a prompt for the LLM.
  3. Generate: The LLM uses this augmented prompt to generate an answer, grounded in the provided context.

RAG significantly reduces hallucinations and improves factual accuracy. However, not all RAG systems are created equal.

HotpotQA: The Adversarial Benchmark for Multi-Hop Reasoning

When you ask an AI a question that requires connecting information from multiple sources, most RAG systems fail. Not obviously — they give you an answer that sounds right but isn't. This is especially true for complex questions that require "multi-hop reasoning."

HotpotQA is an academic benchmark specifically designed to test this: can your system answer questions that require reading two or more passages and connecting the dots? It's an adversarial dataset, meaning questions are crafted to trick models that rely on simple keyword matching or single-document retrieval.

For example, a question might be: "Who was the director of the movie starring the actor who played Gandalf in The Lord of the Rings?"
This requires:

  1. Identifying "Gandalf" and "The Lord of the Rings" to find the actor (Ian McKellen).
  2. Using "Ian McKellen" to find a movie he starred in (e.g., X-Men).
  3. Finding the director of that movie (Bryan Singer).
A simple RAG system might retrieve documents about Gandalf and The Lord of the Rings, but fail to connect that information to another movie and its director.

Most standard RAG systems score around 72% on HotpotQA. The best published result — a method called StepChain GraphRAG, which uses knowledge graphs and multi-step chain-of-thought reasoning — reaches 79.5%.

We wanted to do better.

The Result: Outperforming the State-of-the-Art

Our system scored F1 86.8% on HotpotQA. That's 7.3 points above the best published result and 14.8 points above a standard RAG pipeline.

BespokeWorks Foundry Ours
86.8%
StepChain GraphRAG Published SOTA
79.5%
Standard RAG Pipeline
72.0%

What does F1 mean? F1 is a common metric in question answering, measuring both precision (did the answer contain the right information?) and recall (did it capture all the right information?). An F1 of 86.8% means our system is right about 87 times out of 100 — and when it's partially right, it still captures most of the correct answer.

Beyond F1, we track two additional quality signals:

Correctness
88.1%
Factually accurate answers
Faithfulness
97.3%
Grounded in source documents — near-zero hallucination

Why Training-Free Matters

Most systems that score well on HotpotQA have been fine-tuned on HotpotQA data. They've seen the questions before — or questions very similar to them.

Our system has never seen HotpotQA data. It uses a general-purpose pipeline that works on any domain.

The analogy: This is the difference between a student who memorised the exam answers and one who actually understands the subject. Our system understands.

For businesses, this means: the same system that scores 86.8% on academic questions works on your finance documents, your legal contracts, your medical records — without retraining.

Fine-tuned models break when you move them to a new domain. They need new training data, new compute, new evaluation. Our approach doesn't. Deploy it on Monday, and it works on whatever documents you point it at.

How It Works

The core insight: the bottleneck in RAG isn't retrieval — it's extraction. Our retrieval already captures 97% of relevant passages. The problem is what happens after. The AI finds the right documents but then extracts the wrong answer from them.

We solved this with three techniques.

1
Multi-Prompt Extraction with Evidence-Weighted Voting
We ask the same question 5 different ways, each time with a slightly different emphasis. Each prompt extracts a candidate answer. Then we vote — not majority-vote, but evidence-weighted voting that considers how much supporting evidence each candidate has across all retrieved passages.
2
Bridge Entity Detection
Multi-hop questions have a hidden structure: you need to find Entity A in Document 1, then use Entity A to find the answer in Document 2. We detect these bridge entities automatically and use them to guide extraction, ensuring the system follows the reasoning chain rather than jumping to surface-level matches.
3
Adversarial Candidate Deliberation
When the 5 prompts disagree and produce 3 or more unique candidates, we run a deliberation step: the LLM explicitly reasons about which candidate actually answers the question that was asked. This catches "wrong hop" errors where the system finds a related entity but not the one the question is about.
4
Precision Post-Processing
36% of errors in standard RAG systems come from formatting — the AI knows the right answer but wraps it in unnecessary context. We strip parentheticals, truncate verbose explanations, normalise name variants, and ground extracted spans against the source text. The answer gets shorter and more precise.

Each technique contributes. But the biggest single improvement came from multi-prompt voting — asking the same question in different ways and letting the evidence decide. It's a simple idea, and it works because interpretation diversity surfaces answers that any single prompt might miss.

What We Tested

We didn't cherry-pick results. We tested on 35+ cases with random seeds to avoid overfitting to a specific sample.

This matters more than most people realise. A 10-case sample has a confidence interval of roughly ±16% — you can get an F1 of 0.92 one day and 0.78 the next with the same code. We only trust results from 35+ case evaluations with randomised selection.

We tested 16 pipeline variations to find what actually works. Most ideas that sound good on paper made things worse in practice:

  • Sentence isolation — extracting individual sentences instead of spans. Sounded precise. Lost context. Regressed.
  • Comparison decomposition — breaking comparison questions into sub-questions. Added complexity without accuracy. Regressed.
  • Contrastive verification — asking the model to verify its answer against alternatives. Over-corrected. Changed right answers to wrong ones.
  • High-temperature prompt diversity — using temperature 0.7-0.9 for voting prompts. Introduced noise instead of diversity. Regressed.

More complexity doesn't equal more accuracy. The final system uses techniques that each proved their value in isolation on held-out data.

What This Means For Your Business

If you're building a chatbot, a knowledge system, or any AI that answers questions from your documents, the accuracy of the underlying pipeline determines how much human oversight you need.

At 72% accuracy, someone needs to check nearly every third answer. At 86.8% F1 and 88.1% correctness, the error rate drops by more than half. That's not an incremental improvement — it's the difference between a system that creates work and one that eliminates it.

Our pipeline — the same one that scored 86.8% on the hardest academic benchmark for multi-hop reasoning — is what powers every knowledge system we deploy for clients. It works on finance documents, legal contracts, medical records, technical documentation. No retraining needed.

For context: human annotators score F1 91.4% on this benchmark (Yang et al., 2018) — but each question requires reading 500–1,500 words across multiple passages, identifying which pieces connect, and writing a precise answer. That takes a skilled researcher 2–5 minutes per question on the benchmark alone. In practice, with real business documents — 30-page contracts, dense financial reports, multi-section medical records — the reading time per query is significantly higher. Our system handles the same reasoning in seconds, and gets within 5 points of human accuracy. It's not fully autonomous — high-stakes answers still benefit from a human check — but it eliminates the bulk of the review work that doesn't need one.

And here's what that means in practice: your company's data will very likely produce even higher scores than the benchmark numbers suggest. HotpotQA was engineered with hand-crafted adversarial distractors — passages that share the same named entities and topics as the correct answer but deliberately lead the model in the wrong direction. That kind of adversarial noise doesn't exist in your document store. Your data might be extensive, unstructured, or inconsistently formatted — but it isn't trying to trick the AI. The result: the gap in accuracy between HotpotQA and real enterprise deployments has consistently favoured the production environment. F1, correctness, and faithfulness all tend to be meaningfully higher when the system runs on your actual data.

Talk to us about your knowledge retrieval needs

Whether you need a chatbot that actually gets things right, or an internal knowledge system your team can trust — the technology is already built and tested.