IDP Core Bench
v1.0Built by Nanonets. ~2,000 invoices, receipts, forms, and handwritten docs. Four tasks: field extraction from structured documents (KIE), OCR on printed and handwritten text, table cell extraction and structure parsing, and answering questions about document content (VQA). Overall = mean of all four.
Overall Score = Average of KIE, OCR, Table, and VQA scores
Rankings
# | Model | Overall | KIE | OCR | Table | VQA |
|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 ProGoogle | 89.6 | 86.8 | 82.8 | 96.4 | 85.0 |
| 2 | GPT-5.4OpenAI | 84.4 | 85.7 | 69.1 | 94.8 | 78.2 |
| 3 | Gemini-3-ProGoogle | 81.8 | 85.7 | 81.8 | 95.8 | 64.1 |
| 4 | Claude Sonnet 4.6Anthropic | 81.2 | 89.5 | 73.7 | 96.3 | 65.2 |
| 5 | Claude Opus 4.6Anthropic | 81.1 | 89.8 | 74.0 | 96.0 | 64.4 |
| 6 | Gemini-3-FlashGoogle | 80.5 | 91.1 | 81.7 | 85.6 | 63.5 |
| 7 | GPT-5.2OpenAI | 77.4 | 87.5 | 72.8 | 86.0 | 63.5 |
| 8 | GPT-4.1OpenAI | 74.7 | 87.1 | 75.6 | 73.1 | 63.0 |
| 9 | Nanonets OCR2+Nanonets | 73.8 | 86.4 | 64.0 | 79.7 | 65.1 |
| 10 | GPT-5-MiniOpenAI | 73.3 | 85.7 | 73.0 | 69.5 | 65.0 |
| 11 | Claude Haiku 4.5Anthropic | 72.9 | 85.6 | 65.0 | 81.7 | 59.2 |
| 12 | Ministral-8BMistral AI | 71.7 | 85.7 | 67.8 | 75.9 | 57.4 |
| 13 | GPT-5-NanoOpenAI | 65.8 | 84.7 | 69.6 | 45.3 | 63.5 |
| 14 | Pixtral-12BMistral AI | 59.0 | 76.2 | 54.8 | 47.5 | 57.5 |
| 15 | Llama-3.2-Vision-11BMeta | 58.6 | 76.1 | 65.8 | 41.1 | 51.5 |
| 16 | GLM-OCRZhipu AI | 54.9 | 83.5 | 66.7 | 24.5 | 44.9 |
| 17 | Gemma-3-12B-ITGoogle | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Metrics
KIEHigher is better
Key Information Extraction accuracy on invoices, receipts, and forms using exact-match and fuzzy-match metrics.
OCRHigher is better
OCR accuracy on mixed handwritten and printed text documents.
TableHigher is better
Table understanding including cell-level extraction and structural parsing.
VQAHigher is better
Visual Question Answering requiring reasoning over document layout and content.