Document AI Benchmark: GPT vs. Claude vs. Open-Source on Real-World Invoices

Model selection is one of the first questions teams ask when building document AI systems. Should you use GPT-4o? Claude? An open-source model? We ran a comprehensive benchmark to find out — and the results may surprise you.

Benchmark Setup

We tested the following models on a dataset of 500 real-world invoices from 50 different vendors:

GPT-4o (OpenAI)
GPT-4.1 (OpenAI)
Claude Sonnet 4 (Anthropic)
Gemini 2.5 Pro (Google)
Llama 4 Scout (Meta, open-source)
Qwen 2.5 VL 72B (Alibaba, open-source)
Platform Orchestrated — our pipeline using GPT-4o as the extraction model

What We Measured

Extraction accuracy — percentage of fields correctly extracted
Classification accuracy — correct document type identification
Hallucination rate — percentage of fields where the model invented data not in the document
Average latency — time to process one document
Cost per document — API cost for processing

Methodology

Each model received the same prompt, the same OCR output, and the same field schema. For the “Platform Orchestrated” row, we used our full pipeline: premium OCR → classification → schema-driven extraction → business-rule validation → retry on failure.

Results

Model	Extraction Accuracy	Classification	Hallucination Rate	Avg Latency	Cost/Doc
Platform (orchestrated)	96.8%	99.2%	0.3%	4.2s	$0.08
GPT-4.1	94.1%	98.5%	1.4%	3.1s	$0.05
GPT-4o	93.2%	97.8%	2.1%	2.8s	$0.06
Claude Sonnet 4	92.7%	97.1%	1.8%	3.4s	$0.07
Gemini 2.5 Pro	91.9%	96.5%	2.3%	2.5s	$0.04
Qwen 2.5 VL 72B	89.4%	94.2%	3.1%	5.8s	$0.02
Llama 4 Scout	87.6%	93.1%	3.8%	6.2s	$0.01

Key Findings

1. Orchestration beats raw model performance

The orchestrated pipeline achieved 96.8% accuracy using GPT-4o as its base model — compared to 93.2% for GPT-4o alone. That’s a 3.6 percentage point improvement from orchestration alone.

The accuracy lift comes from:

Premium OCR providing better text input than VLM-only OCR
Schema-driven extraction reducing ambiguity
Business-rule validation catching obvious errors
Retry logic correcting failures on second attempt

2. Hallucination control is the real differentiator

Raw model hallucination rates ranged from 1.4% to 3.8%. A 2% hallucination rate means 10 wrong values per 500 documents — unacceptable in production.

The orchestrated pipeline reduced hallucination to 0.3% through:

Cross-referencing extracted values against OCR text (if the value isn’t in the document, reject it)
Business rules checking data types, formats, and ranges
Confidence thresholds flagging uncertain extractions for human review

3. Cost per document is comparable across models

The difference between the cheapest ($0.01/doc for Llama) and most expensive ($0.08/doc for orchestrated) is small in absolute terms. At enterprise scale (10,000 docs/month), the total cost difference is $700/month — negligible compared to the labor savings.

Optimize for accuracy and reliability, not API cost.

4. Open-source models are viable but not leading

Llama 4 Scout and Qwen 2.5 VL delivered respectable results (87–89% accuracy) and are viable options for organizations with data residency requirements. However, they lag behind commercial models by 5–7 percentage points and have higher hallucination rates.

What This Means for Your Project

Don’t pick a model and call it done. Raw model performance is only part of the story. Orchestration delivers the accuracy lift that matters for production.
Hallucination is your #1 risk. A model that’s 93% accurate but hallucinates 2% of the time is more dangerous than a model that’s 90% accurate with 0.5% hallucination.
Test on your documents. These benchmarks are on invoices. Your mortgage documents, insurance claims, or medical records will produce different results. Use our Benchmark Arena to evaluate on your own document types.

Want to run these benchmarks on your own documents? Start a free trial and use the built-in evaluation tools.

Document AI Benchmark: GPT vs. Claude vs. Open-Source on Real-World Invoices

Benchmark Setup

What We Measured

Methodology

Results

Key Findings

1. Orchestration beats raw model performance

2. Hallucination control is the real differentiator

3. Cost per document is comparable across models

4. Open-source models are viable but not leading

What This Means for Your Project

Ready to try it?

Related Posts

Why GenAI-Native IDP Replaces Template-Based and ML-Training Approaches

How We Process a Mortgage Application in Under 60 Seconds

The Two-Agent Pattern for Production Document AI