Document AI Benchmark Arena

Model alone is not enough. Orchestration wins. Independent benchmarks on real-world document processing tasks across leading AI models.

All Documents Invoices Mortgage Docs Insurance Claims

Leaderboard

Showing results for All Documents (500 documents, 5 types). Last updated: 2026-02-14.

# Model Type Extraction Accuracy Classification Hallucination Rate Avg Latency Cost/Doc
1 Platform (orchestrated) Orchestrated Pipeline 95.4% 99% 0.4% 5.2s $0.10
2 GPT-4.1 Commercial VLM 92.6% 97.8% 1.6% 3.7s $0.07
3 GPT-4o Commercial VLM 91.3% 96.9% 2.4% 3.4s $0.08
4 Claude Sonnet 4 Commercial VLM 91.4% 96.9% 2.1% 4.0s $0.09
5 Gemini 2.5 Pro Commercial VLM 90.2% 95.8% 2.6% 3.1s $0.05
6 Llama 4 Scout Open-Source 85.4% 91.8% 4.3% 7.2s $0.01
7 Qwen 2.5 VL 72B Open-Source VLM 87.6% 93.2% 3.5% 6.7s $0.02

Key Findings

Orchestration beats raw model by 4–10%

The platform's orchestrated pipeline consistently outperforms the best raw model across every document type and metric.

Hallucination drops from 2–4% to 0.3%

Business-rule validation and OCR cross-referencing eliminate almost all hallucinated values from extraction results.

Commercial VLMs lead, open-source closing the gap

GPT-4.1 and Claude Sonnet 4 lead on accuracy, but open-source models like Qwen 2.5 VL are within 5 points — viable for data-residency use cases.

Mortgage docs are the hardest benchmark

Accuracy drops 3–5% on mortgage documents vs. invoices across all models, due to multi-page complexity, handwriting, and format variation.

The Orchestration Advantage

Orchestrating a model through a structured pipeline consistently outperforms using the same model directly.

Raw VLM Call

Send document image to model
Ask for JSON extraction
Hope for the best
91.3%
Extraction accuracy (GPT-4o, mixed)
2.4% hallucination rate

Platform Orchestrated

Premium OCR → text + layout
Schema-driven extraction
Business-rule validation
Retry on failure
95.4%
Extraction accuracy (orchestrated, mixed)
0.4% hallucination rate
Methodology

Dataset: 500 real-world documents across 5 types: Invoices, W-2 Forms, Bank Statements, Insurance Claims, Mortgage Applications.

Scoring: Field-level exact match with fuzzy matching for names and addresses (Levenshtein distance ≤ 2). Dates must match in any standard format. Numbers must match after normalization.

Evaluation Pipeline: Each model receives identical OCR output (Azure AI Document Intelligence premium) and identical extraction schema. The orchestrated pipeline adds business-rule validation and retry logic.

Reproducibility: All prompts, schemas, and evaluation scripts are available upon request. Contact us for access.

Run your own benchmarks

Upload your documents, compare models, and see the results for your use case.

Read the full methodology on our blog