OlmOCR Bench

v1.0

7,010 pass/fail tests built by Allen AI. Feed a page image in, get structured text out. Scored on five axes: does it render math correctly, preserve table layout, pick up headers and captions, suppress watermarks and page numbers, and maintain reading order. Sources include arXiv papers, old scans, financial filings, and multi-column layouts.

Models Evaluated

Dataset Size

1,403 pages · 7,010 tests

Metrics

Source

View on GitHub

Overall Score = Average of Math, Table, Present, Absent, Order, and Base scores

Rankings

#	Model	Overall	Math	Table	Present	Absent	Order
1	Nanonets OCR2+Nanonets	82.2	82.5	86.3	71.6	93.1	76.1
2	Datalab MarkerDatalab	82.1	84.2	83.3	75.5	88.6	73.5
3	Gemini 3.1 ProGoogle	74.6	82.3	89.8	70.5	43.3	65.0
4	Claude Sonnet 4.6Anthropic	74.4	87.1	86.0	47.0	41.8	66.6
5	Claude Opus 4.6Anthropic	73.9	86.1	84.5	49.1	39.9	67.7
6	Gemini-3-ProGoogle	73.5	80.1	85.2	73.4	30.5	73.8
7	GPT-5.4OpenAI	73.4	83.1	91.1	66.9	25.2	74.7
8	GPT-5.2OpenAI	72.2	79.3	86.8	68.9	27.1	72.7
9	Gemini-3-FlashGoogle	69.2	79.2	64.5	75.7	27.8	69.0
10	GLM-OCRZhipu AI	66.7	68.4	58.9	36.1	89.1	71.9
11	Ministral-8BMistral AI	57.8	50.8	79.2	42.0	55.7	71.4
12	GPT-5-MiniOpenAI	56.7	51.5	70.6	70.2	26.1	74.4
13	Claude Haiku 4.5Anthropic	56.2	58.7	83.3	25.5	42.0	53.4
14	GPT-4.1OpenAI	55.5	60.0	59.1	47.3	34.9	59.4
15	Llama-3.2-Vision-11BMeta	47.2	39.4	60.8	30.5	69.9	51.9
16	Pixtral-12BMistral AI	36.8	32.8	46.7	15.4	74.5	25.1
17	GPT-5-NanoOpenAI	22.8	2.4	55.1	13.0	43.1	47.0
18	Gemma-3-12B-ITGoogle	20.6	10.0	34.7	14.6	67.9	7.7

Metrics

MathHigher is better

Accuracy of mathematical equation rendering and LaTeX output.

TableHigher is better

Fidelity of table structure and content preservation.

PresentHigher is better

Correct inclusion of visible document elements like headers and captions.

AbsentHigher is better

Correct suppression of artifacts like watermarks and page numbers.

OrderHigher is better

Accuracy of multi-column and complex layout reading order.