A unified leaderboard for OCR, KIE, classification, QA, table extraction, and confidence score evaluation
The Intelligent Document Processing (IDP) Leaderboard provides a comprehensive evaluation framework for assessing the capabilities of various AI models in document understanding and processing tasks. This benchmark covers seven critical aspects of document intelligence:
The leaderboard aims to provide researchers and practitioners with a standardized way to compare model performance across these diverse document processing tasks. Each task is evaluated using carefully curated datasets that represent real-world document processing challenges.
Rank | Model | Cost | AVG | KIE | VQA | OCR | Classification | LongDocBench | Table |
---|---|---|---|---|---|---|---|---|---|
1 | gemini-2.5-flash-preview-04-17 | 0.133 | 81.00 | 77.99 | 85.16 | 78.9 | 99.05 | 69.08 | 75.82 |
2 | o4-mini-2025-04-16 | 2.595 | 78.56 | 75.43 | 87.07 | 72.82 | 99.14 | 66.13 | 70.76 |
3 | gpt-4.1-2025-04-14 | 1.583 | 78.05 | 72.68 | 80.37 | 75.64 | 99.27 | 66 | 74.34 |
4 | gemini-2.0-flash | 0.022 | 77.62 | 77.22 | 82.03 | 80.05 | 99.1 | 56.01 | 71.32 |
5 | gpt-4o-2024-08-06 | 1.979 | 75.40 | 71.83 | 79.08 | 74.56 | 95.74 | 66.9 | 64.3 |
6 | llama-4-maverick(400B-A17B) | 0.058 | 70.80 | 73.3 | 80.1 | 70.66 | 98.84 | 27.74 | 74.15 |
7 | gpt-4o-mini-2024-07-18 | 2.990 | 69.95 | 70.03 | 72.86 | 72.43 | 98.41 | 55.48 | 50.47 |
7 | qwen2.5-vl-72b-instruct | 0.242 | 68.48 | 76.11 | 80.1 | 69.61 | 99.01 | 37.47 | 48.58 |
9 | mistral-small-3.1-24b-instruct | 0.02 | 61.50 | 63.73 | 71.5 | 51.01 | 91.86 | 29.23 | 61.64 |
10 | gpt-4o-2024-11-20 | 1.868 | 60.08 | 70.91 | 75.6 | 74.91 | 14.38 | 63.95 | 60.74 |
1. Cost represents the average cost in cents per requests for each model.
2. The score for each task in the leaderboard is the average across all the datasets for the corresponding task.
3. We compute edit distance accuracy for all tasks and datasets except classification where we compute exact match, and table extraction where we use GriTS. Please check our paper for more details.
Key Information Extraction (KIE) evaluates a model's ability to identify and extract specific information from documents, such as names, dates, amounts, and other structured data. This task measures how accurately models can locate and understand key entities within documents.
Rank | Model | Avg | Nanonets-KIE | Docile | Handwritten-Forms |
---|---|---|---|---|---|
1 | gemini-2.5-flash-preview-04-17 | 77.99 | 91.29 | 63.35 | 79.34 |
2 | gemini-2.0-flash | 77.22 | 88.31 | 65.06 | 78.28 |
3 | qwen2.5-vl-72b-instruct | 76.11 | 90.52 | 58.37 | 79.45 |
4 | o4-mini-2025-04-16 | 75.43 | 86.91 | 59.52 | 79.85 |
5 | llama-4-maverick(400B-A17B) | 73.30 | 85.78 | 61.70 | 72.43 |
6 | gpt-4.1-2025-04-14 | 72.68 | 87.85 | 61.20 | 68.98 |
7 | gpt-4o-2024-08-06 | 71.83 | 88.63 | 56.37 | 70.48 |
8 | gpt-4o-2024-11-20 | 70.91 | 88.03 | 56.56 | 68.15 |
9 | gpt-4o-mini-2024-07-18 | 70.03 | 86.37 | 60.45 | 63.26 |
10 | mistral-small-3.1-24b-instruct | 63.73 | 75.47 | 47.07 | 68.64 |
Visual Question Answering (VQA) tests a model's ability to understand and answer questions about document content. This includes both text-based questions and questions that require understanding of the document's visual layout and structure.
Rank | Model | Avg | ChartQA | DocVQA |
---|---|---|---|---|
1 | o4-mini-2025-04-16 | 87.07 | 87.78 | 86.35 |
2 | gemini-2.5-flash-preview-04-17 | 85.16 | 84.82 | 85.51 |
3 | gemini-2.0-flash | 82.03 | 79.28 | 84.79 |
4 | gpt-4.1-2025-04-14 | 80.37 | 77.76 | 82.97 |
5 | qwen2.5-vl-72b-instruct | 80.10 | 76.20 | 84.00 |
5 | llama-4-maverick(400B-A17B) | 80.10 | 72.81 | 87.39 |
6 | gpt-4o-2024-08-06 | 79.08 | 75.27 | 82.89 |
7 | gpt-4o-2024-11-20 | 75.60 | 72.34 | 78.87 |
8 | gpt-4o-mini-2024-07-18 | 72.86 | 66.30 | 79.42 |
9 | mistral-small-3.1-24b-instruct | 71.50 | 66.16 | 76.83 |
Optical Character Recognition (OCR) measures a model's ability to accurately convert images of text into machine-readable text. This includes handling various fonts, layouts, and document conditions while maintaining high accuracy in text recognition.
Rank | Model | Avg | OCR-Handwriting | OCR-Handwriting-Rotated | OCR-Digital-Diacritics |
---|---|---|---|---|---|
1 | gemini-2.0-flash | 80.05 | 71.24 | 70.30 | 98.63 |
2 | gemini-2.5-flash-preview-04-17 | 78.90 | 69.40 | 68.46 | 98.85 |
3 | gpt-4.1-2025-04-14 | 75.64 | 65.40 | 62.64 | 98.87 |
4 | gpt-4o-2024-11-20 | 74.91 | 64.39 | 61.58 | 98.75 |
5 | gpt-4o-2024-08-06 | 74.56 | 64.48 | 60.79 | 98.39 |
6 | o4-mini-2025-04-16 | 72.82 | 61.64 | 58.60 | 98.23 |
7 | gpt-4o-mini-2024-07-18 | 72.43 | 61.04 | 59.13 | 97.13 |
8 | llama-4-maverick(400B-A17B) | 70.66 | 58.88 | 55.28 | 97.82 |
9 | qwen2.5-vl-72b-instruct | 69.61 | 56.43 | 57.45 | 94.95 |
10 | mistral-small-3.1-24b-instruct | 51.01 | 41.82 | 39.14 | 72.05 |
Document Classification evaluates how well models can categorize documents into predefined classes or types. This includes understanding document content, structure, and purpose to assign the correct category.
Rank | Model | Avg | Nanonets-Cls |
---|---|---|---|
1 | gpt-4.1-2025-04-14 | 99.27 | 99.27 |
2 | o4-mini-2025-04-16 | 99.14 | 99.14 |
3 | gemini-2.0-flash | 99.10 | 99.10 |
4 | gemini-2.5-flash-preview-04-17 | 99.05 | 99.05 |
5 | qwen2.5-vl-72b-instruct | 99.01 | 99.01 |
6 | llama-4-maverick(400B-A17B) | 98.84 | 98.84 |
7 | gpt-4o-mini-2024-07-18 | 98.41 | 98.41 |
8 | gpt-4o-2024-08-06 | 95.74 | 95.74 |
9 | mistral-small-3.1-24b-instruct | 91.86 | 91.86 |
10 | gpt-4o-2024-11-20 | 14.38 | 14.38 |
Long Document Processing assesses a model's ability to process and understand lengthy documents. This includes maintaining context across multiple pages, understanding document structure, and accurately retrieving information from large documents.
Rank | Model | Avg | Nanonets-LongDocBench |
---|---|---|---|
1 | gemini-2.5-flash-preview-04-17 | 69.08 | 69.08 |
2 | gpt-4o-2024-08-06 | 66.90 | 66.90 |
3 | o4-mini-2025-04-16 | 66.13 | 66.13 |
4 | gpt-4.1-2025-04-14 | 66.00 | 66.00 |
5 | gpt-4o-2024-11-20 | 63.95 | 63.95 |
6 | gemini-2.0-flash | 56.01 | 56.01 |
7 | gpt-4o-mini-2024-07-18 | 55.48 | 55.48 |
8 | qwen2.5-vl-72b-instruct | 37.47 | 37.47 |
9 | mistral-small-3.1-24b-instruct | 29.23 | 29.23 |
10 | llama-4-maverick(400B-A17B) | 27.74 | 27.74 |
Table Extraction evaluates how well models can identify, understand, and extract tabular data from documents. This includes preserving table structure, relationships between cells, and accurately extracting both numerical and textual content.
Rank | Model | Avg | nanonets_small_sparse_structured | nanonets_small_dense_structured | nanonets_small_sparse_unstructured | nanonets_long_dense_structured | nanonets_long_sparse_structured | nanonets_long_sparse_unstructured |
---|---|---|---|---|---|---|---|---|
3 | gemini-2.5-flash-preview-04-17 | 75.82 | 89.00 | 94.30 | 60.36 | 89.99 | 84.33 | 36.92 |
4 | gpt-4.1-2025-04-14 | 74.34 | 90.23 | 97.02 | 66.22 | 75.79 | 69.99 | 46.76 |
5 | llama-4-maverick | 74.15 | 89.94 | 97.90 | 52.57 | 92.50 | 86.24 | 25.74 |
6 | gemini-2.0-flash | 71.32 | 86.35 | 93.09 | 52.12 | 85.07 | 72.70 | 38.62 |
7 | o4-mini-2025-04-16 | 70.76 | 95.48 | 98.70 | 66.64 | 66.56 | 68.51 | 28.65 |
8 | gpt-4o-2024-08-06 | 64.30 | 76.14 | 94.11 | 61.00 | 65.04 | 54.11 | 35.38 |
9 | mistral-small-3.1-24b-instruct | 61.64 | 72.19 | 89.96 | 58.10 | 64.95 | 57.52 | 27.13 |
10 | gpt-4o-2024-11-20 | 60.74 | 76.46 | 92.10 | 61.62 | 53.02 | 49.83 | 31.39 |
11 | gpt-4o-mini-2024-07-18 | 50.47 | 57.31 | 82.50 | 51.34 | 53.37 | 32.60 | 25.67 |
12 | qwen2.5-vl-72b-instruct | 48.58 | 63.03 | 80.77 | 59.60 | 28.45 | 33.41 | 26.21 |
Confidence Score evaluation measures how well models can assess and provide reliable confidence scores for their predictions. This is crucial for building robust automated document processing systems, as it helps determine when human intervention is needed and ensures reliable automation.
Rank | Model | Avg | Datasets |
---|---|---|---|
- | - | - | |
- | - | - |