My System Beats Claude Sonnet 4.6 on Accuracy. Here's the Story.
I asked Claude Opus to cross-check my property article numbers. I found 16 discrepancies. I then verified each one against the Land Registry data manually with Python. Every single time, the result sided with my pipeline.
2026-04-19 · 7 min read

Please don't misread that title.
I am not claiming I built something smarter than Claude Sonnet 4.6. That would be absurd. In fact, Sonnet 4.6 is one of my two go-to models for almost everything I do — the other being Grok. And Claude 4.7's benchmarks rank top almost across all benchmarks, not to mention their latest Mythos. These are extraordinary models. I vibe-code almost everything with Opus and Grok. I am their huge fan.
What I am saying is narrower, and I think more interesting: in one specific, analytic-heavy task, my structured pipeline consistently produces more accurate answers than asking the raw model to do the same job. I have the audit to prove it. And the irony of how I found out is the part worth telling.
The Setup
OpenProp is an AI Analyst that lets users query the London residential property market in plain English, backed by real transaction data from the official HM Land Registry. The database holds every sale in London since 2010. The core idea is simple: instead of asking an AI to search through 5GB of raw data, every factual claim is extracted live from a database before the model touches it. The model interprets, triages, and communicates. The pipeline retrieves.
I recently used OpenProp to write a series of property analysis articles — guides for first-time buyers, people relocating, renters researching boroughs, and property investors. The workflow was straightforward:
I queried OpenProp for the statistics I needed, query by query
I wrote the articles based on those answers
On Claude Code, I asked Opus 4.6 to do a final check on exactly the same queries before publishing — passing it all the needed information and database locations, letting it run in one go
Step 3 is where things got interesting.
What Happened When Claude Fact-Checked
Claude came back with corrections. Several of them. Confident, well-reasoned-sounding corrections.
My usual workflow is: whenever discrepancies arise between two tests, I check against the original official dataset using traditional methods — running Python queries. So I went back and verified against the actual official Land Registry data.
The most interesting thing happened here: OpenProp's original numbers were right. Claude's corrections were wrong. This is something I would never have expected.
Across three articles, I found 16 discrepancies. In every single one, the Land Registry data sided with OpenProp.



3 Articles · 30 Queries · 16 Discrepancies — OpenProp got them all correct. Images by the author.
Here are the ones that tell the story most clearly.
Error 1: The Peak Year That Disappeared
OpenProp's answer: Tower Hamlets flat prices peaked in 2021 at £539,000. By 2025 the median had fallen to £455,000 — a decline of −15.6% from the peak.
Claude's correction: Prices fell −12.5% since 2020 (£520k → £455k).
What the Python verification says: OpenProp was right. 2021 was the peak. 2020 was not.
Claude picked 2020 as a convenient five-year anchor without checking which year was actually the highest. By doing so, it understated the market correction by 3 percentage points — and its suggested five-year chart omitted 2021 entirely. The most important year in the dataset, the one that defines the entire narrative of the market cooling, was simply not there.
Error 2: A Postcode Median Off by £55,000
OpenProp's answer: E3 postcode median flat price in 2025: £420,000
Claude's correction: £365,000
What the Python verification says: OpenProp was right. E3 median is £420,000.
Claude confused two different statistics. The Tower Hamlets borough-wide lower quartile (Q25 across all sales) is £365,000. Claude applied that figure as the E3 postcode-specific median. This is a classic LLM failure mode: when too many similar-sounding strings appear together, the model hallucinates across them — the same way a model might infer a repeated value from a pattern in structured data even when the underlying facts differ.
Error 3: A Transaction Count That Was Fabricated
OpenProp's answer: E3 had 501 flat transactions in 2025.
Claude's correction: 132 transactions.
What the Python verification says: 501. The number Claude gave was not in the data at all.
Error 4: A Percentage Rounded in the Wrong Direction
OpenProp's answer: 59.1% of Tower Hamlets flats sold below £500k in 2025 (1,401 of 2,370 transactions, counted exactly).
Claude's correction: ~55%
What the Python verification says: 59.1%. Claude's estimate was a round number written without querying anything.
Error 5: A Trend That Flipped Sign
OpenProp's answer: Lewisham flat prices rose +1.7% over five years (2021: £361,750 → 2025: £368,000).
Claude's correction: Prices fell −1.9% over five years (from 2020: £375,000).
What the Python verification says: Both figures are internally consistent — they just use different start years. OpenProp consistently uses 2021 as the five-year window start. Claude chose 2020. That one-year difference flipped the trend from positive to negative. The sign on the headline number changed entirely depending on which anchor year you chose.
The Pattern
When I summarised all 16 discrepancies across the three articles, four root causes explained every single one:
Retrieval from memory instead of live data. Transaction figures, percentages, and counts were sometimes recalled or estimated rather than queried — regardless of explicit instructions to get figures from the database.
Conflating similar-sounding statistics. Borough-wide lower quartile confused with postcode-specific median. Same number, completely different meaning.
Inconsistent reference windows. A consequence of the stochastic nature of LLMs. This is one of the key reasons I believe a system with deterministic, definitive answers is necessary for this use case.
Round-number estimation instead of computation. "~55%", "~70%" — sometimes acceptable, but often rounded too aggressively when the exact figure matters.
This Is Not the Model's Fault
I want to be clear about this, because it matters.
Claude Sonnet 4.6's coding ability, reasoning, and language is still top-notch. The errors above are not evidence that the model is unreliable. They are evidence of what happens when you ask any general-purpose language model to do something it was not designed for: live, precise, domain-specific data retrieval.
Claude was trained to be broadly knowledgeable. When asked "what is the Tower Hamlets flat median in 2025?", querying a database is not its first instinct. It searches from its training embeddings first, makes reasonable-sounding inferences, and produces a confident answer. Very often that answer is logically right. But sometimes it is wrong in exactly the ways that matter most — precise numbers, correct reference years, exact counts.
OpenProp's pipeline does something different. Every statistic is extracted from the database before the response is drafted. The model extracts parameters and hands the analysis to the agents. It never remembers any numbers. The model then synthesises and produces the final response. That separation is the whole trick — it is what makes the output accurate and reliable.
"Same model. Different architecture. Completely different accuracy."
This is actually the direction Anthropic themselves are building toward — tool use, retrieval-augmented generation, the entire agentic paradigm. I built a small, focused version of that for one specific problem: London property data.
The Verification
Every claim in this article has been verified by running Python queries in Kaggle against the official HM Land Registry Price Paid Data file. The verification notebook is publicly available — 13 query cells, each printing a MATCH or MISMATCH verdict against the raw data:
View the verification notebook on Kaggle
The Land Registry data is Crown copyright, published under the Open Government Licence v3.0.
Try It Yourself
OpenProp is live now at openprop.co.uk — £1.99 for 10 queries, valid for 7 days. Every answer is grounded in Land Registry data. The pipeline queries first, the model explains second.
There could still be niche areas that haven't been tested. If you find a wrong response, please let us know — but first make sure you are using the same data filtrations. Read the methodology here: How OpenProp Works
If you find a number that does not match the official data while using the same filtrations, I want to know. That is the standard I am holding this to.
Verify any of these figures yourself
Ask OpenProp directly — or run the numbers in our open Kaggle notebook.

Replies