Intelligent Document Processing Leaderboard

About the Leaderboard

The Intelligent Document Processing (IDP) Leaderboard provides a comprehensive evaluation framework for assessing the capabilities of various AI models in document understanding and processing tasks. This benchmark covers seven critical aspects of document intelligence:

Key Information Extraction (KIE): Evaluating the ability to extract structured information from documents
Visual Question Answering (VQA): Testing comprehension of document content through questions
Optical Character Recognition (OCR): Measuring text recognition accuracy across different document types
Document Classification: Assessing categorization capabilities
Long Document Processing: Evaluating performance on lengthy documents
Table Extraction: Testing tabular data understanding and extraction
Confidence Score: Measuring the reliability and calibration of model predictions

The leaderboard aims to provide researchers and practitioners with a standardized way to compare model performance across these diverse document processing tasks. Each task is evaluated using carefully curated datasets that represent real-world document processing challenges.

Leaderboard

Rank	Model	Cost	AVG	KIE	VQA	OCR	Classification	LongDocBench	Table
1	gemini-2.5-pro-preview-06-05 (reasoning: low)	-	82.32	78.92	86.29	78.54	99.31	68.57	82.28
2	gemini-2.5-pro-preview-03-25 (reasoning: low)	1.113	82.04	79.66	85.99	81.18	99.18	66.69	79.51
3	gemini-2.5-flash-preview-04-17	0.133	81.00	77.99	85.16	78.9	99.05	69.08	75.82
4	claude-3.7-sonnet (reasoning:low)	1.748	79.99	76.09	83.47	69.19	98.92	75.93	91.23
5	o4-mini-2025-04-16	2.595	78.56	75.43	87.07	72.82	99.14	66.13	70.76
6	gpt-4.1-2025-04-14	1.583	78.05	72.68	80.37	75.64	99.27	66	74.34
7	gemini-2.0-flash	0.022	77.62	77.22	82.03	80.05	99.1	56.01	71.32
8	gpt-4o-2024-08-06	1.979	75.40	71.83	79.08	74.56	95.74	66.9	64.3
9	claude-sonnet-4	0.959	75.15	71.91	82.51	64.09	98.88	40.06	93.44
10	InternVL3-38B-Instruct	-	72.77	70.31	74.82	66.31	98.84	68.30	58.03
11	gemini-2.5-flash-lite-preview-06-17	0.0555	71.73	77.20	76.28	77.12	98.88	42.36	58.55
12	llama-4-maverick(400B-A17B)	0.058	70.80	73.3	80.1	70.66	98.84	27.74	74.15
13	gpt-4o-mini-2024-07-18	2.990	69.95	70.03	72.86	72.43	98.41	55.48	50.47
14	gemma-3-27b-it	-	69.71	72.81	66.85	54.75	98.49	72.95	52.38
15	qwen2.5-vl-72b-instruct	0.242	68.48	76.11	80.1	69.61	99.01	37.47	48.58
16	gpt-4.1-nano-2025-04-14	0.071	64.56	66.25	74.08	67.09	87.34	27.89	50.83
17	mistral-small-3.1-24b-instruct	0.02	61.50	63.73	71.5	51.01	91.86	29.23	61.64
18	gpt-4o-2024-11-20	1.868	60.08	70.91	75.6	74.91	14.38	63.95	60.74
Pending	qwen2.5-vl-32b-instruct	Pending	Pending	79.63	81.36	Pending	98.71	75.62	77.46
Pending	mistral-medium-3	Pending	Pending	74.21	80.02	69.05	98.39	Pending	70.21

1. Cost represents the average cost in cents per requests for each model.

2. The score for each task in the leaderboard is the average across all the datasets for the corresponding task.

3. We compute edit distance accuracy for all tasks and datasets except classification where we compute exact match, and table extraction where we use GriTS. Please check our paper for more details.

Key Information Extraction (KIE) Leaderboard

Key Information Extraction (KIE) evaluates a model's ability to identify and extract specific information from documents, such as names, dates, amounts, and other structured data. This task measures how accurately models can locate and understand key entities within documents.

Rank	Model	Avg	Nanonets-KIE	Docile	Handwritten-Forms
1	gemini-2.5-pro-preview-03-25 (reasoning: low)	79.66	91.00	65.79	82.18
2	qwen2.5-vl-32b-instruct	79.63	89.18	69.18	80.54
3	gemini-2.5-pro-preview-06-05 (reasoning: low)	78.92	91.33	63.99	81.44
4	gemini-2.5-flash-preview-04-17	77.99	91.29	63.35	79.34
5	gemini-2.0-flash	77.22	88.31	65.06	78.28
6	gemini-2.5-flash-lite-preview-06-17	77.20	90.91	64.48	76.21
7	qwen2.5-vl-72b-instruct	76.11	90.52	58.37	79.45
8	claude-3.7-sonnet (reasoning:low)	76.09	87.61	66.80	73.86
9	o4-mini-2025-04-16	75.43	86.91	59.52	79.85
10	mistral-medium-3	74.21	86.49	61.82	77.94
11	llama-4-maverick(400B-A17B)	73.30	85.78	61.70	72.43
12	gemma-3-27b-it	72.81	85.14	60.18	73.13
13	gpt-4.1-2025-04-14	72.68	87.85	61.20	68.98
14	claude-sonnet-4	71.91	85.78	63.53	66.42
15	gpt-4o-2024-08-06	71.83	88.63	56.37	70.48
16	gpt-4o-2024-11-20	70.91	88.03	56.56	68.15
17	InternVL3-38B-Instruct	70.31	84.02	57.47	69.42
18	gpt-4o-mini-2024-07-18	70.03	86.37	60.45	63.26
19	gpt-4.1-nano-2025-04-14	66.25	80.21	51.13	67.41
20	mistral-small-3.1-24b-instruct	63.73	75.47	47.07	68.64

Visual Question Answering (VQA) Leaderboard

Visual Question Answering (VQA) tests a model's ability to understand and answer questions about document content. This includes both text-based questions and questions that require understanding of the document's visual layout and structure.

Rank	Model	Avg	ChartQA	DocVQA
1	o4-mini-2025-04-16	87.07	87.78	86.35
2	gemini-2.5-pro-preview-06-05 (reasoning: low)	86.29	85.58	87.00
3	gemini-2.5-pro-preview-03-25 (reasoning: low)	85.99	85.77	86.22
4	gemini-2.5-flash-preview-04-17	85.16	84.82	85.51
5	claude-3.7-sonnet (reasoning:low)	83.47	79.43	87.51
6	claude-sonnet-4	82.51	79.13	85.89
7	gemini-2.0-flash	82.03	79.28	84.79
8	qwen2.5-vl-32b-instruct	81.36	77.67	85.05
9	gpt-4.1-2025-04-14	80.37	77.76	82.97
10	qwen2.5-vl-72b-instruct	80.10	76.20	84.00
11	llama-4-maverick(400B-A17B)	80.10	72.81	87.39
12	mistral-medium-3	80.02	74.68	85.37
13	gpt-4o-2024-08-06	79.08	75.27	82.89
14	gemini-2.5-flash-lite-preview-06-17	76.28	68.67	83.89
15	gpt-4o-2024-11-20	75.60	72.34	78.87
16	InternVL3-38B-Instruct	74.82	71.43	78.21
17	gpt-4.1-nano-2025-04-14	74.08	65.27	82.89
18	gpt-4o-mini-2024-07-18	72.86	66.30	79.42
19	mistral-small-3.1-24b-instruct	71.50	66.16	76.83
20	gemma-3-27b-it	66.85	57.65	76.05

Optical Character Recognition (OCR) Leaderboard

Optical Character Recognition (OCR) measures a model's ability to accurately convert images of text into machine-readable text. This includes handling various fonts, layouts, and document conditions while maintaining high accuracy in text recognition.

Rank	Model	Avg	OCR-Handwriting	OCR-Handwriting-Rotated	OCR-Digital-Diacritics
1	gemini-2.5-pro-preview-03-25 (reasoning: low)	81.18	72.19	72.28	99.08
2	gemini-2.0-flash	80.05	71.24	70.30	98.63
3	gemini-2.5-flash-preview-04-17	78.90	69.40	68.46	98.85
4	gemini-2.5-pro-preview-06-05 (reasoning: low)	78.54	68.98	68.39	98.26
5	gemini-2.5-flash-lite-preview-06-17	77.12	67.10	65.98	98.28
6	gpt-4.1-2025-04-14	75.64	65.40	62.64	98.87
7	gpt-4o-2024-11-20	74.91	64.39	61.58	98.75
8	gpt-4o-2024-08-06	74.56	64.48	60.79	98.39
9	o4-mini-2025-04-16	72.82	61.64	58.60	98.23
10	gpt-4o-mini-2024-07-18	72.43	61.04	59.13	97.13
11	llama-4-maverick(400B-A17B)	70.66	58.88	55.28	97.82
12	qwen2.5-vl-72b-instruct	69.61	56.43	57.45	94.95
13	claude-3.7-sonnet (reasoning:low)	69.19	57.41	52.29	97.87
14	mistral-medium-3	69.05	57.63	54.68	94.85
15	gpt-4.1-nano-2025-04-14	67.09	54.48	50.33	96.46
16	InternVL3-38B-Instruct	66.31	51.30	53.02	94.62
17	claude-sonnet-4	64.09	51.64	42.87	97.75
18	gemma-3-27b-it	54.75	55.49	52.64	56.12
19	mistral-small-3.1-24b-instruct	51.01	41.82	39.14	72.05

Document Classification Leaderboard

Document Classification evaluates how well models can categorize documents into predefined classes or types. This includes understanding document content, structure, and purpose to assign the correct category.

Rank	Model	Avg	Nanonets-Cls
1	gemini-2.5-pro-preview-06-05 (reasoning: low)	99.31	99.31
2	gpt-4.1-2025-04-14	99.27	99.27
3	gemini-2.5-pro-preview-03-25 (reasoning: low)	99.18	99.18
4	o4-mini-2025-04-16	99.14	99.14
5	gemini-2.0-flash	99.10	99.10
6	gemini-2.5-flash-preview-04-17	99.05	99.05
7	qwen2.5-vl-72b-instruct	99.01	99.01
8	claude-3.7-sonnet (reasoning:low)	98.92	98.92
9	claude-sonnet-4	98.88	98.88
10	gemini-2.5-flash-lite-preview-06-17	98.88	98.88
11	llama-4-maverick(400B-A17B)	98.84	98.84
12	InternVL3-38B-Instruct	98.84	98.84
13	qwen2.5-vl-32b-instruct	98.71	98.71
14	gemma-3-27b-it	98.49	98.49
15	gpt-4o-mini-2024-07-18	98.41	98.41
16	mistral-medium-3	98.39	98.39
17	gpt-4o-2024-08-06	95.74	95.74
18	mistral-small-3.1-24b-instruct	91.86	91.86
19	gpt-4.1-nano-2025-04-14	87.34	87.34
20	gpt-4o-2024-11-20	14.38	14.38

Long Document Processing Leaderboard

Long Document Processing assesses a model's ability to process and understand lengthy documents. This includes maintaining context across multiple pages, understanding document structure, and accurately retrieving information from large documents.

Rank	Model	Avg	Nanonets-LongDocBench
1	claude-3.7-sonnet (reasoning:low)	75.93	75.93
2	qwen2.5-vl-32b-instruct	75.62	75.62
3	gemma-3-27b-it	72.95	72.95
4	gemini-2.5-flash-preview-04-17	69.08	69.08
5	gemini-2.5-pro-preview-06-05 (reasoning: low)	68.57	68.57
6	InternVL3-38B-Instruct	68.30	68.30
7	gpt-4o-2024-08-06	66.90	66.90
8	gemini-2.5-pro-preview-03-25 (reasoning: low)	66.69	66.69
9	o4-mini-2025-04-16	66.13	66.13
10	gpt-4.1-2025-04-14	66.00	66.00
11	gpt-4o-2024-11-20	63.95	63.95
12	gemini-2.0-flash	56.01	56.01
13	gpt-4o-mini-2024-07-18	55.48	55.48
14	gemini-2.5-flash-lite-preview-06-17	42.36	42.36
15	claude-sonnet-4	40.06	40.06
16	qwen2.5-vl-72b-instruct	37.47	37.47
17	mistral-small-3.1-24b-instruct	29.23	29.23
18	gpt-4.1-nano-2025-04-14	27.89	27.89
19	llama-4-maverick(400B-A17B)	27.74	27.74

Table Extraction Leaderboard

Table Extraction evaluates how well models can identify, understand, and extract tabular data from documents. This includes preserving table structure, relationships between cells, and accurately extracting both numerical and textual content.

Rank	Model	Avg	nanonets_small_sparse_structured	nanonets_small_dense_structured	nanonets_small_sparse_unstructured	nanonets_long_dense_structured	nanonets_long_sparse_structured	nanonets_long_sparse_unstructured
1	claude-sonnet-4	93.44	98.14	98.98	83.95	92.75	94.87	91.96
2	claude-3.7-sonnet (reasoning:low)	91.23	98.29	99.06	84.89	92.82	92.92	79.38
3	gemini-2.5-pro-preview-06-05 (reasoning: low)	82.28	88.51	94.21	82.74	84.00	83.15	61.06
4	gemini-2.5-pro-preview-03-25 (reasoning: low)	79.51	81.80	86.58	72.58	91.72	88.36	55.99
5	qwen2.5-vl-32b-instruct	77.46	99.07	98.89	34.55	89.22	86.29	56.74
6	gemini-2.5-flash-preview-04-17	75.82	89.00	94.30	60.36	89.99	84.33	36.92
7	gpt-4.1-2025-04-14	74.34	90.23	97.02	66.22	75.79	69.99	46.76
8	llama-4-maverick	74.15	89.94	97.90	52.57	92.50	86.24	25.74
9	gemini-2.0-flash	71.32	86.35	93.09	52.12	85.07	72.70	38.62
10	o4-mini-2025-04-16	70.76	95.48	98.70	66.64	66.56	68.51	28.65
11	mistral-medium-3	70.21	82.58	97.30	65.61	75.60	64.33	35.86
12	gpt-4o-2024-08-06	64.30	76.14	94.11	61.00	65.04	54.11	35.38
13	mistral-small-3.1-24b-instruct	61.64	72.19	89.96	58.10	64.95	57.52	27.13
14	gpt-4o-2024-11-20	60.74	76.46	92.10	61.62	53.02	49.83	31.39
15	gemini-2.5-flash-lite-preview-06-17	58.55	65.28	79.68	47.37	69.40	59.73	29.84
16	InternVL3-38B-Instruct	58.03	84.37	92.19	74.46	32.00	33.31	31.85
17	gemma-3-27b-it	52.38	77.11	88.60	56.01	39.42	28.55	24.57
18	gpt-4.1-nano-2025-04-14	50.83	68.48	89.32	47.21	50.82	33.23	15.89
19	gpt-4o-mini-2024-07-18	50.47	57.31	82.50	51.34	53.37	32.60	25.67
20	qwen2.5-vl-72b-instruct	48.58	63.03	80.77	59.60	28.45	33.41	26.21

Confidence Score Leaderboard (WIP)

Confidence Score evaluation measures how well models can assess and provide reliable confidence scores for their predictions. This is crucial for building robust automated document processing systems, as it helps determine when human intervention is needed and ensures reliable automation.

Rank	Model	Avg	Datasets
-	-	-
-	-	-