Case study · Reading tables from images

Teaching a small model to read tables better than the frontier.

A table packs data into rows, columns and merged headers, and to use any of it a model has to rebuild that structure exactly, every cell in the right place. Get one wrong and the data is wrong. We took 1,421 tables spanning subjects like accounting, statistics and operations research, each labelled by hand with its correct structure, and fine-tuned small open models that can read images, training them to turn a table picture into clean, machine-readable HTML. The best, an 8B model, scores 84.9% on TEDS, a standard measure of how closely the output matches the real table. That’s ahead of every big cloud model we tested, including GPT-5.5 and Gemini 3 Pro. The released 7B model is right behind, and the model and a data sample are public.

Method

Supervised fine-tuning (QLoRA)

Domain

Table structure recognition

Base models

Qwen2.5-VL-7B / Qwen3-VL-8B

Evaluation

135 held-out tables · TEDS / S-TEDS

84.9% TEDS

The best fine-tuned model on the 135-table held-out set — the top score of every system we tested.

+7.3pts

TEDS margin over the strongest frontier cloud model, GPT-5.5, on the same tables.

1,421

Tables in the whole dataset. Specialisation, not scale, drives the result.

Reproducibility

Everything here is public.

The fine-tuned model and a dataset sample are public, and the held-out tables are the same ones every model in the comparison was scored on.

Open the model on Hugging Face → Download a data sample →

The problem

What it takes to read a table correctly.

A table is one of the densest ways a document stores information, and to use any of it you have to recover the structure exactly: which value sits in which row and column, which header spans which cells, how the nesting works. Get one cell wrong and the data it carries is wrong.

Frontier vision-language models handle generic web tables well. They struggle on the tables that matter most in practice: the dense, multi-level layouts in financial statements, statistical tables and engineering references, where headers span several rows, cells merge, and columns nest. Those are exactly the tables in Indian academic textbooks, which is why we built the dataset from them: nine textbooks across six subjects, from financial accounting to operations research.

What anyone digitising documents wants is plain: hand the model a picture of a table and get back its exact structure as HTML, every cell in the right row and column, every merged header preserved, ready to query. This is where general models break. They read the words but lose the structure, collapsing a multi-row header into one line, merging cells that should stay separate, or dropping a column, returning HTML that looks like a table but no longer matches the one in the image. Here is the kind of table we mean.

Financial · multi-row header

A ledger with a two- or three-row header — Particulars, then Debit and Credit, then sub-columns — above rows that each align to the deepest heading.

What it takes: hold the whole header hierarchy and map every figure to the right leaf column. Flatten the header and every number lands under the wrong heading.

Statistical · merged cells

A frequency table where one class label spans several rows, or a single heading spans several columns.

What it takes: detect each span and reproduce it with the right row and column attributes. Miss a span and every row below shifts out of alignment.

Lookup · dense grid

A reference grid — interest factors, conversion values — that is mostly numbers under thin headers.

What it takes: keep every cell in its exact row and column across a large grid. Drop or shift one and the lookup returns the wrong value.

What makes these tables trainable is that the correct structure is unambiguous: there is one right HTML for each, written and checked by an expert. That gold structure is what we fine-tune on, and it is where the approach begins.

The approach

What we did, in four steps.

01 · Curate

Real tables, hard layouts

1,421 tables from nine Indian academic textbooks across six subjects — financial accounting, business statistics, quantitative techniques, operations research, engineering references and ethics — chosen for the multi-row headers, merged cells and nesting that generic data lacks.

02 · Annotate

Gold-standard structure

Each table annotated as ground-truth HTML with semantic tags and merged-cell attributes, plus a metadata record of domain, table type and structure. Every annotation human-verified.

03 · Fine-tune

Efficient adaptation (QLoRA)

Supervised fine-tuning of two open vision-language models, Qwen2.5-VL-7B and Qwen3-VL-8B, with LoRA adapters on the vision and language layers under 4-bit quantisation. Rank 32, three epochs, on the 1,141-table training split. Modest compute.

04 · Evaluate

Held-out, like-for-like

TEDS and S-TEDS on a 135-table held-out set. Every frontier cloud model was run on the same tables, prompted zero-shot with the identical instruction, so the comparison is like-for-like in-domain.

The result

A fine-tuned 8B model tops every frontier system we tested.

All scores are TEDS and S-TEDS on the same 135-table held-out set. The fine-tuned models are specialised on the domain; the cloud models are general-purpose, prompted zero-shot with the identical instruction.

DrishtiTable test set (135 tables) · higher is better
Model	TEDS (%)	S-TEDS (%)
Qwen3-VL-8B · fine-tuned	84.89	90.43
Qwen2.5-VL-7B · fine-tuned, released	83.20	89.70
GPT-5.5	77.58	89.86
GPT-5.4	73.64	85.66
Gemini 3 Pro Preview	67.95	77.03
Gemini 3.1 Pro Preview	66.96	76.62
Claude Sonnet 4.6	53.53	61.83

How to read this. The fine-tuned models are specialised on the target domain and evaluated on a held-out set; the cloud models are general-purpose, prompted zero-shot with the same instruction. This measures in-domain table structure recognition, not general capability. The two metrics differ on purpose: full TEDS scores content and structure together, where the 8B fine-tune leads by 7.3 points; structure-only S-TEDS is closer at the top, so the fine-tune’s clearest advantage is getting the whole table right, cells and all. The publicly released checkpoint is the Qwen2.5-VL-7B at 83.2 TEDS, itself ahead of every cloud model tested; the 84.9 headline is the Qwen3-VL-8B variant.

The outcome

The model now recovers the structure the frontier loses.

Before fine-tuning, general models failed on these tables in specific, recognisable ways. Each is a place where the structure breaks:

Collapsed headers. A multi-row header flattened into a single line, so every figure below it lands under the wrong heading.
Mis-merged cells. Cells that should span several rows or columns split apart, or separate cells fused, shifting the whole grid out of alignment.
A dropped or invented column. A column missed entirely, or a phantom one added, so every row after it sits one place off.
Right text, wrong place. The words read correctly but land in the wrong cell. The content is there; the structure is not.

Fine-tuning on gold-annotated tables fixes these directly, because the model is trained to reproduce the exact structure, cell for cell. On the held-out tables the best fine-tuned model reaches 84.9% TEDS, ahead of every frontier cloud model tested and 7.3 points clear of the strongest, GPT-5.5. The publicly released 7B model is right behind at 83.2, itself ahead of every cloud system. The tables that general models mangle now come back as clean, faithful HTML.

The model and a dataset sample are public. And because the gain came from a small set of gold-annotated tables and an efficient fine-tune, the same recipe carries to any document domain where the structure has to be right — financial statements, regulatory filings, forms, scientific tables.

Reproduce it

Run it on your own tables.

The fine-tuned model and a dataset sample are public, so the result can be checked on tables of your own.

Model · released 7B checkpointOpen on Hugging Face → Dataset · 20-table public sampleOpen the dataset →

from transformers import AutoProcessor, AutoModelForImageTextToText

repo = "Nalandadata/DrishtiTable-Qwen2.5-VL-7B"
model = AutoModelForImageTextToText.from_pretrained(repo, device_map="auto")
processor = AutoProcessor.from_pretrained(repo)

inputs = processor(images=table_image, text="Convert this table to HTML.",
                   return_tensors="pt").to(model.device)
print(processor.decode(model.generate(**inputs)[0], skip_special_tokens=True))

The released checkpoint is the Qwen2.5-VL-7B fine-tune — 83.2% TEDS, itself ahead of every cloud model tested. The 84.9% headline is the Qwen3-VL-8B variant.

Questions

The detail behind the result.

How were the cloud models evaluated?+

Each cloud model was given the same instruction used to fine-tune our models and scored with TEDS and S-TEDS on the identical 135-table held-out set. The fine-tuned models are domain-specialised; the cloud models are general-purpose. The result reflects in-domain table structure recognition, not overall capability.

Why only about 1,400 tables?+

Because quality is the constraint, not quantity. The model specialises from a small set of gold-annotated examples. The point of the work is that the right thousand-odd tables beat a general model trained on far more.

Which model is released?+

The Qwen2.5-VL-7B fine-tune, at 83.2% TEDS — already ahead of every cloud model tested. The 84.9% headline is the larger Qwen3-VL-8B variant, which needs more training data to realise its edge.

What about structure-only scores?+

On S-TEDS, which scores structure alone and ignores cell content, the strongest cloud model is close. The fine-tune’s clearest advantage is on full TEDS, getting the whole table right, content and structure together. Both columns are on this page.

Can we get the data, or data for our own domain?+

Yes. A 20-table sample is public, and the curate-annotate-fine-tune pipeline extends to other structured-document domains. Download a sample, or talk to a researcher to scope it.

Try it

Want this accuracy on your own document tables?

Download a sample Talk to a researcher

Related research