OmniDocBench

v1.5

Built by OpenDataLab. 1,355 pages from papers, books, slides, exams, newspapers, and magazines. Scores text extraction via edit distance, formula recognition via CDM, table structure via TEDS, and reading order accuracy. Overall = ((1 − Text Edit) × 100 + Table TEDS + Formula CDM) / 3.

Models Evaluated

Dataset Size

1,355 pages

Metrics

Source

View on GitHub

Overall Score = ((1 - Text Edit) x 100 + Table TEDS + Formula CDM) / 3

Rankings

#	Model	Overall	Text Edit↓	CDM↑	TEDS↑	TEDS-S↑	Read Order↓
1	Gemini-3-FlashGoogle	90.1	0.077	90.2	87.7	92.6	0.081
2	Nanonets OCR2+Nanonets	89.5	0.056	90.3	79.1	83.6	0.090
3	Gemini-3-ProGoogle	88.8	0.078	87.3	87.0	91.7	0.084
4	GPT-5.2OpenAI	88.0	0.111	90.1	84.9	89.5	0.098
5	Claude Sonnet 4.6Anthropic	86.9	0.165	90.2	87.1	91.2	0.149
6	Claude Opus 4.6Anthropic	85.9	0.151	88.5	84.4	89.1	0.136
7	Datalab MarkerDatalab	85.5	0.109	88.3	79.1	83.7	0.106
8	Gemini 3.1 ProGoogle	85.3	0.082	83.3	80.8	85.4	0.073
9	GPT-5.4OpenAI	85.3	0.089	83.4	81.3	86.7	0.077
10	GPT-5-MiniOpenAI	82.5	0.138	86.7	74.6	80.1	0.121
11	GPT-4.1OpenAI	79.9	0.167	82.2	74.0	83.8	0.115
12	Claude Haiku 4.5Anthropic	79.6	0.224	84.2	77.1	83.8	0.178
13	Ministral-8BMistral AI	78.3	0.157	83.3	67.1	73.8	0.125
14	GLM-OCRZhipu AI	69.2	0.144	84.7	37.4	39.3	0.141
15	GPT-5-NanoOpenAI	63.4	0.319	61.0	61.2	69.5	0.243
16	Gemma-3-12B-ITGoogle	44.6	0.476	50.0	31.6	46.9	0.364
17	Llama-3.2-Vision-11BMeta	44.6	0.541	55.4	32.6	42.9	0.340
18	Pixtral-12BMistral AI	42.3	0.641	58.8	32.1	50.8	0.422

Metrics

Text EditLower is better

Character-level edit distance between predicted and ground-truth text blocks. Lower values indicate more accurate text extraction.

CDMHigher is better

Character Detection Matching score for display formulas. Measures structural and symbolic accuracy of recognized mathematical expressions.

TEDSHigher is better

Tree Edit Distance-based Similarity for tables. Evaluates both content and structure of extracted tables.

TEDS-SHigher is better

Structure-only TEDS that ignores cell content. Focuses purely on table layout and cell spanning.

Read OrderLower is better

Edit distance measuring how well the model preserves the correct reading order across multi-column and complex layouts.