IDP Leaderboard: How We Evaluate Document AI
23 models, 3 benchmarks, 4 core IDP tasks. All evaluation code is open source.
Sponsored by Nanonets
Why
Existing VLM leaderboards (OpenVLM, Chatbot Arena, LiveBench) test general capabilities. None of them focus on the tasks that matter for document processing pipelines: pulling fields out of invoices, reading handwritten text, parsing tables, answering questions about charts. This leaderboard fills that gap.
Tasks
The IDP Core benchmark evaluates four tasks. Each task uses multiple datasets; the task score is the mean across its datasets.
Key Information Extraction (KIE)
Extract structured key/value pairs (invoice numbers, dates, totals) from documents. Datasets: Nanonets KIE, DocILE, Handwritten Forms. Metric: edit distance accuracy.
Visual Question Answering (VQA)
Answer natural language questions about document content, including charts and plots. Datasets: ChartQA, DocVQA. Metric: edit distance accuracy.
Optical Character Recognition (OCR)
Transcribe text from images across handwritten, rotated, and diacritics scenarios. Datasets: OCR Handwriting, OCR Handwriting Rotated, OCR Digital Diacritics. Metric: edit distance accuracy.
Table Extraction
Parse tabular data into structured output. Covers sparse, dense, structured, and unstructured table layouts at both short and long document lengths (6 sub-datasets). Metric: GriTS.
Scoring
For each task, compute the mean score across its datasets. The overall leaderboard score is the unweighted mean of all task scores.
TaskScorei = mean(Si,1, Si,2, ..., Si,n)
OverallScore = mean(TaskScore1, TaskScore2, ..., TaskScoreT)
Edit distance accuracy is used for KIE, OCR, and VQA. Classification uses exact match. Table extraction uses GriTS.
How It Works
Each model gets the same prompt and document image. OCR and VQA prompts expect a plain text response. KIE and table extraction prompts expect JSON output matching a specified schema. All datasets include ground truth annotations. We run inference once per (model, document) pair, then score against ground truth using the task-appropriate metric.
The full evaluation pipeline is in the docext repo. You can reproduce every number on the leaderboard.
Current Results
Snapshot from the live leaderboard (23 models evaluated). See the full leaderboard for complete data.
| Category | Best Model | Score |
|---|---|---|
| Overall | Gemini 3.1 Pro Google | 83.2% |
| KIE | Gemini-3-Flash Google | 91.1% |
| OCR | Gemini 3.1 Pro Google | 82.8% |
| Table | Gemini 3.1 Pro Google | 96.4% |
| VQA | Gemini 3.1 Pro Google | 85.0% |
What's Next
We are adding new models on a rolling basis and will rotate datasets to prevent overfitting. Want a specific model evaluated? Open a discussion on GitHub.
Resources
Cite
@misc{IDPLeaderboard,
title={IDPLeaderboard: A Unified Leaderboard for
Intelligent Document Processing Tasks},
author={Souvik Mandal and Nayancy Gupta and
Ashish Talewar and Paras Ahuja and
Prathamesh Juvatkar and Gourinath Banda},
howpublished={https://idp-leaderboard.org/},
year={2025},
}Souvik Mandal1, Nayancy Gupta2, Ashish Talewar1, Paras Ahuja1, Prathamesh Juvatkar1, Gourinath Banda2
1Nanonets 2IIT Indore