A training-free RAG system that outperforms graph-based and fine-tuned alternatives on HotpotQA.
Large Language Models (LLMs) are powerful, but they have a fundamental flaw: they hallucinate. They generate plausible-sounding but factually incorrect information. This is a major problem for businesses that need accurate, verifiable answers from their data.
To combat this, a technique called Retrieval Augmented Generation (RAG) was developed. Instead of relying solely on the LLM's internal knowledge, RAG systems first retrieve relevant information from a trusted knowledge base (like your company documents) and then provide that information to the LLM as context for generating an answer.
A typical RAG pipeline looks like this:
RAG significantly reduces hallucinations and improves factual accuracy. However, not all RAG systems are created equal.
When you ask an AI a question that requires connecting information from multiple sources, most RAG systems fail. Not obviously — they give you an answer that sounds right but isn't. This is especially true for complex questions that require "multi-hop reasoning."
HotpotQA is an academic benchmark specifically designed to test this: can your system answer questions that require reading two or more passages and connecting the dots? It's an adversarial dataset, meaning questions are crafted to trick models that rely on simple keyword matching or single-document retrieval.
For example, a question might be: "Who was the director of the movie starring the actor who played Gandalf in The Lord of the Rings?"
This requires:
Most standard RAG systems score around 72% on HotpotQA. The best published result — a method called StepChain GraphRAG, which uses knowledge graphs and multi-step chain-of-thought reasoning — reaches 79.5%.
We wanted to do better.
Our system scored F1 86.8% on HotpotQA. That's 7.3 points above the best published result and 14.8 points above a standard RAG pipeline.
What does F1 mean? F1 is a common metric in question answering, measuring both precision (did the answer contain the right information?) and recall (did it capture all the right information?). An F1 of 86.8% means our system is right about 87 times out of 100 — and when it's partially right, it still captures most of the correct answer.
Beyond F1, we track two additional quality signals:
Most systems that score well on HotpotQA have been fine-tuned on HotpotQA data. They've seen the questions before — or questions very similar to them.
Our system has never seen HotpotQA data. It uses a general-purpose pipeline that works on any domain.
The analogy: This is the difference between a student who memorised the exam answers and one who actually understands the subject. Our system understands.
For businesses, this means: the same system that scores 86.8% on academic questions works on your finance documents, your legal contracts, your medical records — without retraining.
Fine-tuned models break when you move them to a new domain. They need new training data, new compute, new evaluation. Our approach doesn't. Deploy it on Monday, and it works on whatever documents you point it at.
The core insight: the bottleneck in RAG isn't retrieval — it's extraction. Our retrieval already captures 97% of relevant passages. The problem is what happens after. The AI finds the right documents but then extracts the wrong answer from them.
We solved this with three techniques.
Each technique contributes. But the biggest single improvement came from multi-prompt voting — asking the same question in different ways and letting the evidence decide. It's a simple idea, and it works because interpretation diversity surfaces answers that any single prompt might miss.
We didn't cherry-pick results. We tested on 35+ cases with random seeds to avoid overfitting to a specific sample.
This matters more than most people realise. A 10-case sample has a confidence interval of roughly ±16% — you can get an F1 of 0.92 one day and 0.78 the next with the same code. We only trust results from 35+ case evaluations with randomised selection.
We tested 16 pipeline variations to find what actually works. Most ideas that sound good on paper made things worse in practice:
More complexity doesn't equal more accuracy. The final system uses techniques that each proved their value in isolation on held-out data.
If you're building a chatbot, a knowledge system, or any AI that answers questions from your documents, the accuracy of the underlying pipeline determines how much human oversight you need.
At 72% accuracy, someone needs to check nearly every third answer. At 86.8% F1 and 88.1% correctness, the error rate drops by more than half. That's not an incremental improvement — it's the difference between a system that creates work and one that eliminates it.
Our pipeline — the same one that scored 86.8% on the hardest academic benchmark for multi-hop reasoning — is what powers every knowledge system we deploy for clients. It works on finance documents, legal contracts, medical records, technical documentation. No retraining needed.
For context: human annotators score F1 91.4% on this benchmark (Yang et al., 2018) — but each question requires reading 500–1,500 words across multiple passages, identifying which pieces connect, and writing a precise answer. That takes a skilled researcher 2–5 minutes per question on the benchmark alone. In practice, with real business documents — 30-page contracts, dense financial reports, multi-section medical records — the reading time per query is significantly higher. Our system handles the same reasoning in seconds, and gets within 5 points of human accuracy. It's not fully autonomous — high-stakes answers still benefit from a human check — but it eliminates the bulk of the review work that doesn't need one.
And here's what that means in practice: your company's data will very likely produce even higher scores than the benchmark numbers suggest. HotpotQA was engineered with hand-crafted adversarial distractors — passages that share the same named entities and topics as the correct answer but deliberately lead the model in the wrong direction. That kind of adversarial noise doesn't exist in your document store. Your data might be extensive, unstructured, or inconsistently formatted — but it isn't trying to trick the AI. The result: the gap in accuracy between HotpotQA and real enterprise deployments has consistently favoured the production environment. F1, correctness, and faithfulness all tend to be meaningfully higher when the system runs on your actual data.
Whether you need a chatbot that actually gets things right, or an internal knowledge system your team can trust — the technology is already built and tested.