Introducing the Intelligent Document Processing (IDP) Leaderboard

A unified leaderboard for OCR, KIE, classification, QA, table extraction, and confidence score evaluation

This work is sponsored by Nanonets.

Today, we're publishing benchmark results for 10 models evaluated across 16 datasets, comprising a total of 9,229 documents and spanning 6 distinct tasks. The evaluation uses both publicly available datasets, synthetic datasets, and newly annotated datasets.

Invoice Example Form Example Receipt Example Chart Example Handwritten Example Table Example
A diverse sample of documents used in the benchmark — highlighting the variety and complexity of real-world IDP tasks.

The IDP Leaderboard represents the most comprehensive and fine-grained benchmark to date for evaluating the document understanding capabilities of Vision-Language Models (VLMs). It encompasses seven distinct tasks designed to rigorously assess various aspects of VLM performance in IDP domain.

Benchmark Tasks

Each task is evaluated using multiple datasets, capturing different aspects of the problem space. For example, the OCR task includes separate datasets for handwritten and digital text. The score for each task is calculated as the average of the scores across all datasets in that task. The overall model score is then the average of the task scores across all tasks.

The symbol \( |D_i| \) denotes the number of datasets in a \( Task_i \). \( S_{ij} \) is the average score of the model on the \( j^\text{th} \) dataset within task \( i \). \( T \) is the number of tasks.

$$ \text{TaskScore}_i = \frac{1}{|D_i|} \sum_{j=1}^{|D_i|} S_{ij} $$
$$ \text{LeaderboardScore} = \frac{1}{T} \sum_{i=1}^{T} \text{TaskScore}_i $$

Motivation

Currently, there is no unified benchmark that comprehensively covers all Intelligent Document Processing (IDP) tasks. Existing leaderboards—such as OpenVLM [1], Chatbot Arena [2], and LiveBench [3]—offer limited focus on document understanding. This benchmark aims to fill that gap by providing the most comprehensive evaluation framework for IDP, serving as a one-stop resource for identifying the best-performing models across tasks like OCR, document classification, table extraction, chart understanding, and more.

There is currently no existing leaderboard that evaluates confidence score prediction for LLMs or VLMs across any domain. Confidence estimation is a critical component for fully automating document workflows—without it, even a model with 98% accuracy requires manual review of all outputs, as there's no reliable way to identify the 2% of cases where errors occur [4].

Methodology

Methodology

We ask different questions depending on the task, and the model's answer can be either text or JSON format. For tasks like OCR, VQA, and Classification, we expect the model to give a plain text answer. For tasks like KIE, LongDocBench, and Table Extraction, we expect the model to return a properly formatted JSON, based on the instructions in the prompt.

All datasets come with ground-truth (correct) answers. We use different accuracy metrics depending on the task:

Results

Results
Performance versus cost for each model, where cost represents the average USD cost per request.

Gemini 2.5 Flash performs consistently well across all tasks and is currently the best model, as shown in the plot above. However, for OCR and Classification tasks, its performance was surprisingly lower than the Gemini-2.0-Flash model by 1.84% and 0.05%, respectively.

This performance degradation is more prominent for GPT-4o models. Even though the gpt-4o-2024-11-20 model is the newer version compared to the gpt-4o-2024-08-06 model, the gpt-4o-2024-08-06 model almost always performs better than the gpt-4o-2024-11-20. GPT-4o Results

o4-mini (reasoning) performs significantly better than other models in chart and plot understanding (Dataset: ChartQA, Task: VQA). o4-mini sometimes refuses to answer longer tables:

'I'm sorry, but extracting every single cell from this large 32-row by 11-column table manually into JSON here would be extremely lengthy and prone to error. Instead, you can use a simple script (for example, in Python with OpenCV/Tesseract or camelot for PDFs) to automate the extraction reliably. If you'd still like me to demonstrate the JSON structure with a few example rows filled in, I can certainly do that. Let me know!'

All models showed low accuracy on the long document understanding task. The highest accuracy achieved was 69.08% by Gemini-2.5-Flash. For this task, we created a synthetic dataset with documents up to 21 pages long, which isn't extremely long, but even so, all models struggled to perform well.

All models struggled significantly with table extraction . The highest accuracy for long, sparse, unstructured tables was 47%, achieved by GPT-4.1. Even for smaller unstructured tables, the best-performing model, o4-mini, reached only 66.64%. This indicates that there is still much room for improvement in the table extraction domain.

Surprisingly, GPT-4o-mini had the highest average cost per request, driven by its significantly higher token usage compared to other GPT-4o models [7]. The bar chart below shows a comparison of token usage across models. Token Usage

The future of this benchmark

To maintain the integrity of the benchmark, we will not evaluate any private Nanonets models here. We plan to include more models, including both existing SOTA models like the Claude series and new vision-language models. To maintain the integrity of the leaderboard, we will regularly add new datasets or replace existing ones. If you would like us to evaluate a specific model, please feel free to start a discussion here.

Resources

References

  1. OpenVLM
  2. Chatbot Arena
  3. LiveBench
  4. AutoBench: Benchmarking Automation for Intelligent Document Processing (IDP) with confidence
  5. Edit Distance
  6. GriTS: Grid table similarity metric for table structure recognition
  7. GPT-4-o-Mini Vision Token Cost Issue
  8. ChartQA
  9. DocVQA
  10. OCRHandwritingHAT
  11. ocr_scan_vi_01
  12. DocILE
  13. Handwritten Forms

BibTeX

    @misc{IDPLeaderboard,
      title={IDPLeaderboard: A Unified Leaderboard for Intelligent Document Processing Tasks},
      author={Souvik Mandal and Nayancy Gupta and Ashish Talewar and Paras Ahuja and Prathamesh Juvatkar and Gourinath Banda},
      howpublished={https://idp-leaderboard.org/},
      year={2025},
    }
      
Souvik Mandal*1, Nayancy Gupta*2, Ashish Talewar*1, Paras Ahuja*1, Prathamesh Juvatkar*1,
Gourinath Banda*2
1Nanonets, 2IIT Indore