All Benchmarks/IDP Core Bench

IDP Core Bench

v1.0

Built by Nanonets. ~2,000 invoices, receipts, forms, and handwritten docs. Four tasks: field extraction from structured documents (KIE), OCR on printed and handwritten text, table cell extraction and structure parsing, and answering questions about document content (VQA). Overall = mean of all four.

Models Evaluated

17

Dataset Size

~2,000 documents

Metrics

4

Source

View on GitHub

Overall Score = Average of KIE, OCR, Table, and VQA scores

Rankings

#
Model
Overall
KIE
OCR
Table
VQA
1Gemini 3.1 ProGoogle89.686.882.896.485.0
2GPT-5.4OpenAI84.485.769.194.878.2
3Gemini-3-ProGoogle81.885.781.895.864.1
4Claude Sonnet 4.6Anthropic81.289.573.796.365.2
5Claude Opus 4.6Anthropic81.189.874.096.064.4
6Gemini-3-FlashGoogle80.591.181.785.663.5
7GPT-5.2OpenAI77.487.572.886.063.5
8GPT-4.1OpenAI74.787.175.673.163.0
9Nanonets OCR2+Nanonets73.886.464.079.765.1
10GPT-5-MiniOpenAI73.385.773.069.565.0
11Claude Haiku 4.5Anthropic72.985.665.081.759.2
12Ministral-8BMistral AI71.785.767.875.957.4
13GPT-5-NanoOpenAI65.884.769.645.363.5
14Pixtral-12BMistral AI59.076.254.847.557.5
15Llama-3.2-Vision-11BMeta58.676.165.841.151.5
16GLM-OCRZhipu AI54.983.566.724.544.9
17Gemma-3-12B-ITGoogle0.00.00.00.00.0

Metrics

KIEHigher is better

Key Information Extraction accuracy on invoices, receipts, and forms using exact-match and fuzzy-match metrics.

OCRHigher is better

OCR accuracy on mixed handwritten and printed text documents.

TableHigher is better

Table understanding including cell-level extraction and structural parsing.

VQAHigher is better

Visual Question Answering requiring reasoning over document layout and content.