Document AI Benchmark: GPT vs. Claude vs. Open-Source on Real-World Invoices
Model selection is one of the first questions teams ask when building document AI systems. Should you use GPT-4o? Claude? An open-source model? We ran a comprehensive benchmark to find out — and the results may surprise you.
Benchmark Setup
We tested the following models on a dataset of 500 real-world invoices from 50 different vendors:
- GPT-4o (OpenAI)
- GPT-4.1 (OpenAI)
- Claude Sonnet 4 (Anthropic)
- Gemini 2.5 Pro (Google)
- Llama 4 Scout (Meta, open-source)
- Qwen 2.5 VL 72B (Alibaba, open-source)
- Platform Orchestrated — our pipeline using GPT-4o as the extraction model
What We Measured
- Extraction accuracy — percentage of fields correctly extracted
- Classification accuracy — correct document type identification
- Hallucination rate — percentage of fields where the model invented data not in the document
- Average latency — time to process one document
- Cost per document — API cost for processing
Methodology
Each model received the same prompt, the same OCR output, and the same field schema. For the “Platform Orchestrated” row, we used our full pipeline: premium OCR → classification → schema-driven extraction → business-rule validation → retry on failure.
Results
| Model | Extraction Accuracy | Classification | Hallucination Rate | Avg Latency | Cost/Doc |
|---|---|---|---|---|---|
| Platform (orchestrated) | 96.8% | 99.2% | 0.3% | 4.2s | $0.08 |
| GPT-4.1 | 94.1% | 98.5% | 1.4% | 3.1s | $0.05 |
| GPT-4o | 93.2% | 97.8% | 2.1% | 2.8s | $0.06 |
| Claude Sonnet 4 | 92.7% | 97.1% | 1.8% | 3.4s | $0.07 |
| Gemini 2.5 Pro | 91.9% | 96.5% | 2.3% | 2.5s | $0.04 |
| Qwen 2.5 VL 72B | 89.4% | 94.2% | 3.1% | 5.8s | $0.02 |
| Llama 4 Scout | 87.6% | 93.1% | 3.8% | 6.2s | $0.01 |
Key Findings
1. Orchestration beats raw model performance
The orchestrated pipeline achieved 96.8% accuracy using GPT-4o as its base model — compared to 93.2% for GPT-4o alone. That’s a 3.6 percentage point improvement from orchestration alone.
The accuracy lift comes from:
- Premium OCR providing better text input than VLM-only OCR
- Schema-driven extraction reducing ambiguity
- Business-rule validation catching obvious errors
- Retry logic correcting failures on second attempt
2. Hallucination control is the real differentiator
Raw model hallucination rates ranged from 1.4% to 3.8%. A 2% hallucination rate means 10 wrong values per 500 documents — unacceptable in production.
The orchestrated pipeline reduced hallucination to 0.3% through:
- Cross-referencing extracted values against OCR text (if the value isn’t in the document, reject it)
- Business rules checking data types, formats, and ranges
- Confidence thresholds flagging uncertain extractions for human review
3. Cost per document is comparable across models
The difference between the cheapest ($0.01/doc for Llama) and most expensive ($0.08/doc for orchestrated) is small in absolute terms. At enterprise scale (10,000 docs/month), the total cost difference is $700/month — negligible compared to the labor savings.
Optimize for accuracy and reliability, not API cost.
4. Open-source models are viable but not leading
Llama 4 Scout and Qwen 2.5 VL delivered respectable results (87–89% accuracy) and are viable options for organizations with data residency requirements. However, they lag behind commercial models by 5–7 percentage points and have higher hallucination rates.
What This Means for Your Project
-
Don’t pick a model and call it done. Raw model performance is only part of the story. Orchestration delivers the accuracy lift that matters for production.
-
Hallucination is your #1 risk. A model that’s 93% accurate but hallucinates 2% of the time is more dangerous than a model that’s 90% accurate with 0.5% hallucination.
-
Test on your documents. These benchmarks are on invoices. Your mortgage documents, insurance claims, or medical records will produce different results. Use our Benchmark Arena to evaluate on your own document types.
Want to run these benchmarks on your own documents? Start a free trial and use the built-in evaluation tools.