Document AI Benchmark Arena

Model alone is not enough. Orchestration wins. Independent benchmarks on real-world document processing tasks across leading AI models.

All Documents Invoices Mortgage Docs Insurance Claims

Leaderboard

Showing results for All Documents (500 documents, 5 types). Last updated: 2026-02-14.

#	Model	Type	Extraction Accuracy	Classification	Hallucination Rate	Avg Latency	Cost/Doc
1	Platform (orchestrated)	Orchestrated Pipeline	95.4%	99%	0.4%	5.2s	$0.10
2	GPT-4.1	Commercial VLM	92.6%	97.8%	1.6%	3.7s	$0.07
3	GPT-4o	Commercial VLM	91.3%	96.9%	2.4%	3.4s	$0.08
4	Claude Sonnet 4	Commercial VLM	91.4%	96.9%	2.1%	4.0s	$0.09
5	Gemini 2.5 Pro	Commercial VLM	90.2%	95.8%	2.6%	3.1s	$0.05
6	Llama 4 Scout	Open-Source	85.4%	91.8%	4.3%	7.2s	$0.01
7	Qwen 2.5 VL 72B	Open-Source VLM	87.6%	93.2%	3.5%	6.7s	$0.02

Key Findings

Orchestration beats raw model by 4–10%

The platform's orchestrated pipeline consistently outperforms the best raw model across every document type and metric.

Hallucination drops from 2–4% to 0.3%

Business-rule validation and OCR cross-referencing eliminate almost all hallucinated values from extraction results.

Commercial VLMs lead, open-source closing the gap

GPT-4.1 and Claude Sonnet 4 lead on accuracy, but open-source models like Qwen 2.5 VL are within 5 points — viable for data-residency use cases.

Mortgage docs are the hardest benchmark

Accuracy drops 3–5% on mortgage documents vs. invoices across all models, due to multi-page complexity, handwriting, and format variation.

The Orchestration Advantage

Orchestrating a model through a structured pipeline consistently outperforms using the same model directly.

Raw VLM Call

Send document image to model →

Ask for JSON extraction →

Hope for the best ✗

91.3%

Extraction accuracy (GPT-4o, mixed)

2.4% hallucination rate

Platform Orchestrated

Premium OCR → text + layout ✓

Schema-driven extraction ✓

Business-rule validation ✓

Retry on failure ✓

95.4%

Extraction accuracy (orchestrated, mixed)

0.4% hallucination rate

Methodology

Dataset: 500 real-world documents across 5 types: Invoices, W-2 Forms, Bank Statements, Insurance Claims, Mortgage Applications.

Scoring: Field-level exact match with fuzzy matching for names and addresses (Levenshtein distance ≤ 2). Dates must match in any standard format. Numbers must match after normalization.

Evaluation Pipeline: Each model receives identical OCR output (Azure AI Document Intelligence premium) and identical extraction schema. The orchestrated pipeline adds business-rule validation and retry logic.

Reproducibility: All prompts, schemas, and evaluation scripts are available upon request. Contact us for access.

Run your own benchmarks

Upload your documents, compare models, and see the results for your use case.

Start Free Trial

Read the full methodology on our blog