IDP Core Bench

Name: IDP Core Bench Leaderboard
Creator: Nanonets

v1.0

Built by Nanonets. ~2,000 invoices, receipts, forms, and handwritten docs. Four tasks: field extraction from structured documents (KIE), OCR on printed and handwritten text, table cell extraction and structure parsing, and answering questions about document content (VQA). Overall = mean of all four.

Models Evaluated

Dataset Size

~2,000 documents

Metrics

Source

View on GitHub

Overall Score = Average of KIE, OCR, Table, and VQA scores

Rankings

#	Model	Overall	KIE	OCR	Table	VQA
1	Gemini 3.1 ProGoogle	89.6	86.8	82.8	96.4	85.0
2	GPT-5.4OpenAI	84.4	85.7	69.1	94.8	78.2
3	Gemini-3-ProGoogle	81.8	85.7	81.8	95.8	64.1
4	Claude Sonnet 4.6Anthropic	81.2	89.5	73.7	96.3	65.2
5	Claude Opus 4.6Anthropic	81.1	89.8	74.0	96.0	64.4
6	Gemini-3-FlashGoogle	80.5	91.1	81.7	85.6	63.5
7	GPT-5.2OpenAI	77.4	87.5	72.8	86.0	63.5
8	GPT-4.1OpenAI	74.7	87.1	75.6	73.1	63.0
9	Nanonets OCR2+Nanonets	73.8	86.4	64.0	79.7	65.1
10	GPT-5-MiniOpenAI	73.3	85.7	73.0	69.5	65.0
11	Claude Haiku 4.5Anthropic	72.9	85.6	65.0	81.7	59.2
12	Ministral-8BMistral AI	71.7	85.7	67.8	75.9	57.4
13	GPT-5-NanoOpenAI	65.8	84.7	69.6	45.3	63.5
14	Pixtral-12BMistral AI	59.0	76.2	54.8	47.5	57.5
15	Llama-3.2-Vision-11BMeta	58.6	76.1	65.8	41.1	51.5
16	GLM-OCRZhipu AI	54.9	83.5	66.7	24.5	44.9
17	Gemma-3-12B-ITGoogle	0.0	0.0	0.0	0.0	0.0

Metrics

KIEHigher is better

Key Information Extraction accuracy on invoices, receipts, and forms using exact-match and fuzzy-match metrics.

OCRHigher is better

OCR accuracy on mixed handwritten and printed text documents.

TableHigher is better

Table understanding including cell-level extraction and structural parsing.

VQAHigher is better

Visual Question Answering requiring reasoning over document layout and content.