Document AI Benchmark Arena
Model alone is not enough. Orchestration wins. Independent benchmarks on real-world document processing tasks across leading AI models.
Leaderboard
Showing results for All Documents (500 documents, 5 types). Last updated: 2026-02-14.
| # | Model | Type | Extraction Accuracy | Classification | Hallucination Rate | Avg Latency | Cost/Doc |
|---|---|---|---|---|---|---|---|
| 1 | Platform (orchestrated) | Orchestrated Pipeline | 95.4% | 99% | 0.4% | 5.2s | $0.10 |
| 2 | GPT-4.1 | Commercial VLM | 92.6% | 97.8% | 1.6% | 3.7s | $0.07 |
| 3 | GPT-4o | Commercial VLM | 91.3% | 96.9% | 2.4% | 3.4s | $0.08 |
| 4 | Claude Sonnet 4 | Commercial VLM | 91.4% | 96.9% | 2.1% | 4.0s | $0.09 |
| 5 | Gemini 2.5 Pro | Commercial VLM | 90.2% | 95.8% | 2.6% | 3.1s | $0.05 |
| 6 | Llama 4 Scout | Open-Source | 85.4% | 91.8% | 4.3% | 7.2s | $0.01 |
| 7 | Qwen 2.5 VL 72B | Open-Source VLM | 87.6% | 93.2% | 3.5% | 6.7s | $0.02 |
Key Findings
Orchestration beats raw model by 4–10%
The platform's orchestrated pipeline consistently outperforms the best raw model across every document type and metric.
Hallucination drops from 2–4% to 0.3%
Business-rule validation and OCR cross-referencing eliminate almost all hallucinated values from extraction results.
Commercial VLMs lead, open-source closing the gap
GPT-4.1 and Claude Sonnet 4 lead on accuracy, but open-source models like Qwen 2.5 VL are within 5 points — viable for data-residency use cases.
Mortgage docs are the hardest benchmark
Accuracy drops 3–5% on mortgage documents vs. invoices across all models, due to multi-page complexity, handwriting, and format variation.
The Orchestration Advantage
Orchestrating a model through a structured pipeline consistently outperforms using the same model directly.
Raw VLM Call
Platform Orchestrated
Methodology
Dataset: 500 real-world documents across 5 types: Invoices, W-2 Forms, Bank Statements, Insurance Claims, Mortgage Applications.
Scoring: Field-level exact match with fuzzy matching for names and addresses (Levenshtein distance ≤ 2). Dates must match in any standard format. Numbers must match after normalization.
Evaluation Pipeline: Each model receives identical OCR output (Azure AI Document Intelligence premium) and identical extraction schema. The orchestrated pipeline adds business-rule validation and retry logic.
Reproducibility: All prompts, schemas, and evaluation scripts are available upon request. Contact us for access.
Run your own benchmarks
Upload your documents, compare models, and see the results for your use case.