Intelligent Document Processing Leaderboard

A unified leaderboard for OCR, KIE, classification, QA, table extraction, and confidence score evaluation

This work is sponsored by Nanonets.

About the Leaderboard

The Intelligent Document Processing (IDP) Leaderboard provides a comprehensive evaluation framework for assessing the capabilities of various AI models in document understanding and processing tasks. This benchmark covers seven critical aspects of document intelligence:

  • Key Information Extraction (KIE): Evaluating the ability to extract structured information from documents
  • Visual Question Answering (VQA): Testing comprehension of document content through questions
  • Optical Character Recognition (OCR): Measuring text recognition accuracy across different document types
  • Document Classification: Assessing categorization capabilities
  • Long Document Processing: Evaluating performance on lengthy documents
  • Table Extraction: Testing tabular data understanding and extraction
  • Confidence Score: Measuring the reliability and calibration of model predictions

The leaderboard aims to provide researchers and practitioners with a standardized way to compare model performance across these diverse document processing tasks. Each task is evaluated using carefully curated datasets that represent real-world document processing challenges.

Leaderboard

Rank Model Cost AVG KIE VQA OCR Classification LongDocBench Table
1 gemini-2.5-flash-preview-04-17 0.133 81.00 77.99 85.16 78.9 99.05 69.08 75.82
2 o4-mini-2025-04-16 2.595 78.56 75.43 87.07 72.82 99.14 66.13 70.76
3 gpt-4.1-2025-04-14 1.583 78.05 72.68 80.37 75.64 99.27 66 74.34
4 gemini-2.0-flash 0.022 77.62 77.22 82.03 80.05 99.1 56.01 71.32
5 gpt-4o-2024-08-06 1.979 75.40 71.83 79.08 74.56 95.74 66.9 64.3
6 llama-4-maverick(400B-A17B) 0.058 70.80 73.3 80.1 70.66 98.84 27.74 74.15
7 gpt-4o-mini-2024-07-18 2.990 69.95 70.03 72.86 72.43 98.41 55.48 50.47
7 qwen2.5-vl-72b-instruct 0.242 68.48 76.11 80.1 69.61 99.01 37.47 48.58
9 mistral-small-3.1-24b-instruct 0.02 61.50 63.73 71.5 51.01 91.86 29.23 61.64
10 gpt-4o-2024-11-20 1.868 60.08 70.91 75.6 74.91 14.38 63.95 60.74

1. Cost represents the average cost in cents per requests for each model.

2. The score for each task in the leaderboard is the average across all the datasets for the corresponding task.

3. We compute edit distance accuracy for all tasks and datasets except classification where we compute exact match, and table extraction where we use GriTS. Please check our paper for more details.

Key Information Extraction (KIE) Leaderboard

Key Information Extraction (KIE) evaluates a model's ability to identify and extract specific information from documents, such as names, dates, amounts, and other structured data. This task measures how accurately models can locate and understand key entities within documents.

Rank Model Avg Nanonets-KIE Docile Handwritten-Forms
1 gemini-2.5-flash-preview-04-17 77.99 91.29 63.35 79.34
2 gemini-2.0-flash 77.22 88.31 65.06 78.28
3 qwen2.5-vl-72b-instruct 76.11 90.52 58.37 79.45
4 o4-mini-2025-04-16 75.43 86.91 59.52 79.85
5 llama-4-maverick(400B-A17B) 73.30 85.78 61.70 72.43
6 gpt-4.1-2025-04-14 72.68 87.85 61.20 68.98
7 gpt-4o-2024-08-06 71.83 88.63 56.37 70.48
8 gpt-4o-2024-11-20 70.91 88.03 56.56 68.15
9 gpt-4o-mini-2024-07-18 70.03 86.37 60.45 63.26
10 mistral-small-3.1-24b-instruct 63.73 75.47 47.07 68.64

Visual Question Answering (VQA) Leaderboard

Visual Question Answering (VQA) tests a model's ability to understand and answer questions about document content. This includes both text-based questions and questions that require understanding of the document's visual layout and structure.

Rank Model Avg ChartQA DocVQA
1 o4-mini-2025-04-16 87.07 87.78 86.35
2 gemini-2.5-flash-preview-04-17 85.16 84.82 85.51
3 gemini-2.0-flash 82.03 79.28 84.79
4 gpt-4.1-2025-04-14 80.37 77.76 82.97
5 qwen2.5-vl-72b-instruct 80.10 76.20 84.00
5 llama-4-maverick(400B-A17B) 80.10 72.81 87.39
6 gpt-4o-2024-08-06 79.08 75.27 82.89
7 gpt-4o-2024-11-20 75.60 72.34 78.87
8 gpt-4o-mini-2024-07-18 72.86 66.30 79.42
9 mistral-small-3.1-24b-instruct 71.50 66.16 76.83

Optical Character Recognition (OCR) Leaderboard

Optical Character Recognition (OCR) measures a model's ability to accurately convert images of text into machine-readable text. This includes handling various fonts, layouts, and document conditions while maintaining high accuracy in text recognition.

Rank Model Avg OCR-Handwriting OCR-Handwriting-Rotated OCR-Digital-Diacritics
1 gemini-2.0-flash 80.05 71.24 70.30 98.63
2 gemini-2.5-flash-preview-04-17 78.90 69.40 68.46 98.85
3 gpt-4.1-2025-04-14 75.64 65.40 62.64 98.87
4 gpt-4o-2024-11-20 74.91 64.39 61.58 98.75
5 gpt-4o-2024-08-06 74.56 64.48 60.79 98.39
6 o4-mini-2025-04-16 72.82 61.64 58.60 98.23
7 gpt-4o-mini-2024-07-18 72.43 61.04 59.13 97.13
8 llama-4-maverick(400B-A17B) 70.66 58.88 55.28 97.82
9 qwen2.5-vl-72b-instruct 69.61 56.43 57.45 94.95
10 mistral-small-3.1-24b-instruct 51.01 41.82 39.14 72.05

Document Classification Leaderboard

Document Classification evaluates how well models can categorize documents into predefined classes or types. This includes understanding document content, structure, and purpose to assign the correct category.

Rank Model Avg Nanonets-Cls
1 gpt-4.1-2025-04-14 99.27 99.27
2 o4-mini-2025-04-16 99.14 99.14
3 gemini-2.0-flash 99.10 99.10
4 gemini-2.5-flash-preview-04-17 99.05 99.05
5 qwen2.5-vl-72b-instruct 99.01 99.01
6 llama-4-maverick(400B-A17B) 98.84 98.84
7 gpt-4o-mini-2024-07-18 98.41 98.41
8 gpt-4o-2024-08-06 95.74 95.74
9 mistral-small-3.1-24b-instruct 91.86 91.86
10 gpt-4o-2024-11-20 14.38 14.38

Long Document Processing Leaderboard

Long Document Processing assesses a model's ability to process and understand lengthy documents. This includes maintaining context across multiple pages, understanding document structure, and accurately retrieving information from large documents.

Rank Model Avg Nanonets-LongDocBench
1 gemini-2.5-flash-preview-04-17 69.08 69.08
2 gpt-4o-2024-08-06 66.90 66.90
3 o4-mini-2025-04-16 66.13 66.13
4 gpt-4.1-2025-04-14 66.00 66.00
5 gpt-4o-2024-11-20 63.95 63.95
6 gemini-2.0-flash 56.01 56.01
7 gpt-4o-mini-2024-07-18 55.48 55.48
8 qwen2.5-vl-72b-instruct 37.47 37.47
9 mistral-small-3.1-24b-instruct 29.23 29.23
10 llama-4-maverick(400B-A17B) 27.74 27.74

Table Extraction Leaderboard

Table Extraction evaluates how well models can identify, understand, and extract tabular data from documents. This includes preserving table structure, relationships between cells, and accurately extracting both numerical and textual content.

Rank Model Avg nanonets_small_sparse_structured nanonets_small_dense_structured nanonets_small_sparse_unstructured nanonets_long_dense_structured nanonets_long_sparse_structured nanonets_long_sparse_unstructured
3 gemini-2.5-flash-preview-04-17 75.82 89.00 94.30 60.36 89.99 84.33 36.92
4 gpt-4.1-2025-04-14 74.34 90.23 97.02 66.22 75.79 69.99 46.76
5 llama-4-maverick 74.15 89.94 97.90 52.57 92.50 86.24 25.74
6 gemini-2.0-flash 71.32 86.35 93.09 52.12 85.07 72.70 38.62
7 o4-mini-2025-04-16 70.76 95.48 98.70 66.64 66.56 68.51 28.65
8 gpt-4o-2024-08-06 64.30 76.14 94.11 61.00 65.04 54.11 35.38
9 mistral-small-3.1-24b-instruct 61.64 72.19 89.96 58.10 64.95 57.52 27.13
10 gpt-4o-2024-11-20 60.74 76.46 92.10 61.62 53.02 49.83 31.39
11 gpt-4o-mini-2024-07-18 50.47 57.31 82.50 51.34 53.37 32.60 25.67
12 qwen2.5-vl-72b-instruct 48.58 63.03 80.77 59.60 28.45 33.41 26.21

Confidence Score Leaderboard (WIP)

Confidence Score evaluation measures how well models can assess and provide reliable confidence scores for their predictions. This is crucial for building robust automated document processing systems, as it helps determine when human intervention is needed and ensures reliable automation.

Rank Model Avg Datasets
- - -
- - -

BibTeX

@misc{IDPLeaderboard,
  title={IDPLeaderboard: A Unified Leaderboard for Intelligent Document Processing Tasks},
  author={Souvik Mandal and Nayancy Gupta and Ashish Talewar and Paras Ahuja and Prathamesh Juvatkar and Gourinath Banda},
  howpublished={https://idp-leaderboard.org},
  year={2025},
}
Souvik Mandal*1, Nayancy Gupta*2, Ashish Talewar*1, Paras Ahuja*1, Prathamesh Juvatkar*1,
Gourinath Banda*2
1Nanonets, 2IIT Indore