Case study · Reading tables from images
Teaching a small model to read tables better than the frontier.
A table packs data into rows, columns and merged headers, and to use any of it a model has to rebuild that structure exactly, every cell in the right place. Get one wrong and the data is wrong. We took 1,421 tables spanning subjects like accounting, statistics and operations research, each labelled by hand with its correct structure, and fine-tuned small open models that can read images, training them to turn a table picture into clean, machine-readable HTML. The best, an 8B model, scores 84.9% on TEDS, a standard measure of how closely the output matches the real table. That’s ahead of every big cloud model we tested, including GPT-5.5 and Gemini 3 Pro. The released 7B model is right behind, and the model and a data sample are public.
Reproducibility
Everything here is public.
The fine-tuned model and a dataset sample are public, and the held-out tables are the same ones every model in the comparison was scored on.
The problem
What it takes to read a table correctly.
A table is one of the densest ways a document stores information, and to use any of it you have to recover the structure exactly: which value sits in which row and column, which header spans which cells, how the nesting works. Get one cell wrong and the data it carries is wrong.
Frontier vision-language models handle generic web tables well. They struggle on the tables that matter most in practice: the dense, multi-level layouts in financial statements, statistical tables and engineering references, where headers span several rows, cells merge, and columns nest. Those are exactly the tables in Indian academic textbooks, which is why we built the dataset from them: nine textbooks across six subjects, from financial accounting to operations research.
What anyone digitising documents wants is plain: hand the model a picture of a table and get back its exact structure as HTML, every cell in the right row and column, every merged header preserved, ready to query. This is where general models break. They read the words but lose the structure, collapsing a multi-row header into one line, merging cells that should stay separate, or dropping a column, returning HTML that looks like a table but no longer matches the one in the image. Here is the kind of table we mean.
A ledger with a two- or three-row header — Particulars, then Debit and Credit, then sub-columns — above rows that each align to the deepest heading.
What it takes: hold the whole header hierarchy and map every figure to the right leaf column. Flatten the header and every number lands under the wrong heading.
A frequency table where one class label spans several rows, or a single heading spans several columns.
What it takes: detect each span and reproduce it with the right row and column attributes. Miss a span and every row below shifts out of alignment.
A reference grid — interest factors, conversion values — that is mostly numbers under thin headers.
What it takes: keep every cell in its exact row and column across a large grid. Drop or shift one and the lookup returns the wrong value.
What makes these tables trainable is that the correct structure is unambiguous: there is one right HTML for each, written and checked by an expert. That gold structure is what we fine-tune on, and it is where the approach begins.
The approach
What we did, in four steps.
Real tables, hard layouts
1,421 tables from nine Indian academic textbooks across six subjects — financial accounting, business statistics, quantitative techniques, operations research, engineering references and ethics — chosen for the multi-row headers, merged cells and nesting that generic data lacks.
Gold-standard structure
Each table annotated as ground-truth HTML with semantic tags and merged-cell attributes, plus a metadata record of domain, table type and structure. Every annotation human-verified.
Efficient adaptation (QLoRA)
Supervised fine-tuning of two open vision-language models, Qwen2.5-VL-7B and Qwen3-VL-8B, with LoRA adapters on the vision and language layers under 4-bit quantisation. Rank 32, three epochs, on the 1,141-table training split. Modest compute.
Held-out, like-for-like
TEDS and S-TEDS on a 135-table held-out set. Every frontier cloud model was run on the same tables, prompted zero-shot with the identical instruction, so the comparison is like-for-like in-domain.
The result
A fine-tuned 8B model tops every frontier system we tested.
All scores are TEDS and S-TEDS on the same 135-table held-out set. The fine-tuned models are specialised on the domain; the cloud models are general-purpose, prompted zero-shot with the identical instruction.
| Model | TEDS (%) | S-TEDS (%) |
|---|---|---|
| Qwen3-VL-8B · fine-tuned | 84.89 | 90.43 |
| Qwen2.5-VL-7B · fine-tuned, released | 83.20 | 89.70 |
| GPT-5.5 | 77.58 | 89.86 |
| GPT-5.4 | 73.64 | 85.66 |
| Gemini 3 Pro Preview | 67.95 | 77.03 |
| Gemini 3.1 Pro Preview | 66.96 | 76.62 |
| Claude Sonnet 4.6 | 53.53 | 61.83 |
How to read this. The fine-tuned models are specialised on the target domain and evaluated on a held-out set; the cloud models are general-purpose, prompted zero-shot with the same instruction. This measures in-domain table structure recognition, not general capability. The two metrics differ on purpose: full TEDS scores content and structure together, where the 8B fine-tune leads by 7.3 points; structure-only S-TEDS is closer at the top, so the fine-tune’s clearest advantage is getting the whole table right, cells and all. The publicly released checkpoint is the Qwen2.5-VL-7B at 83.2 TEDS, itself ahead of every cloud model tested; the 84.9 headline is the Qwen3-VL-8B variant.
The outcome
The model now recovers the structure the frontier loses.
Before fine-tuning, general models failed on these tables in specific, recognisable ways. Each is a place where the structure breaks:
- Collapsed headers. A multi-row header flattened into a single line, so every figure below it lands under the wrong heading.
- Mis-merged cells. Cells that should span several rows or columns split apart, or separate cells fused, shifting the whole grid out of alignment.
- A dropped or invented column. A column missed entirely, or a phantom one added, so every row after it sits one place off.
- Right text, wrong place. The words read correctly but land in the wrong cell. The content is there; the structure is not.
Fine-tuning on gold-annotated tables fixes these directly, because the model is trained to reproduce the exact structure, cell for cell. On the held-out tables the best fine-tuned model reaches 84.9% TEDS, ahead of every frontier cloud model tested and 7.3 points clear of the strongest, GPT-5.5. The publicly released 7B model is right behind at 83.2, itself ahead of every cloud system. The tables that general models mangle now come back as clean, faithful HTML.
The model and a dataset sample are public. And because the gain came from a small set of gold-annotated tables and an efficient fine-tune, the same recipe carries to any document domain where the structure has to be right — financial statements, regulatory filings, forms, scientific tables.
Reproduce it
Run it on your own tables.
The fine-tuned model and a dataset sample are public, so the result can be checked on tables of your own.
from transformers import AutoProcessor, AutoModelForImageTextToText
repo = "Nalandadata/DrishtiTable-Qwen2.5-VL-7B"
model = AutoModelForImageTextToText.from_pretrained(repo, device_map="auto")
processor = AutoProcessor.from_pretrained(repo)
inputs = processor(images=table_image, text="Convert this table to HTML.",
return_tensors="pt").to(model.device)
print(processor.decode(model.generate(**inputs)[0], skip_special_tokens=True))
The released checkpoint is the Qwen2.5-VL-7B fine-tune — 83.2% TEDS, itself ahead of every cloud model tested. The 84.9% headline is the Qwen3-VL-8B variant.
Questions
The detail behind the result.
How were the cloud models evaluated?+
Why only about 1,400 tables?+
Which model is released?+
What about structure-only scores?+
Can we get the data, or data for our own domain?+
Related research