Case study · Science diagram understanding

Teaching a vision model to read science diagrams.

We wanted a model that could look at a science diagram — a circuit, a molecule, a geometry figure — and reason its way to the answer. That’s hard, because science diagrams use their own visual shorthand, nothing like the everyday photos these models learn from. So we built 22,679 science questions, each paired with its diagram and a step-by-step worked answer, and fine-tuned a LLaMA-3.2-Vision-11B model on them. The key move: we trained the part of the model that sees, not just the part that reads, since the seeing is where these models fall short on diagrams. On a held-out set its accuracy rose from 38.3% to 47.5%, a 9.3-point gain, with mathematics nearly tripling. The model and a dataset sample are public.

Method

Visual instruction tuning (QLoRA)

Domain

Science diagram understanding

Base model

LLaMA-3.2-Vision-11B

Evaluation

162 held-out questions · 4 STEM subjects

+9.3pts

Overall accuracy on the held-out set, 38.3% → 47.5%, after fine-tuning on the dataset.

+23.5pts

Mathematics, the largest subject gain — accuracy nearly tripled, from 14.7% to 38.2%.

22,679

Multimodal science QA pairs across four subjects, every answer with its reasoning.

Reproducibility

Everything here is public.

The fine-tuned model and a dataset sample are public, and the held-out questions are the same ones the baseline was scored on.

Open the model on Hugging Face → Download a data sample →

The problem

What it takes to read a science diagram.

A large share of science is diagram-dependent. A physics question hinges on a circuit, a maths question on a geometric figure, a chemistry question on a molecular structure. To answer it, a model has to read the diagram correctly first; everything after that depends on getting the picture right.

Vision-language models read natural photographs well, and they break on scientific diagrams, because diagrams use a visual vocabulary of their own. A zig-zag line is a resistor, a hexagonal ring is benzene, an arc with a label is an angle. The symbols carry exact meaning, the spatial relationships matter, and a single misread — a subscript, an axis, which line is which — sends the whole answer wrong. None of it looks like the natural images these models were trained on.

What anyone building science tools wants is plain: show the model the diagram and the question, and have it reason from what it sees to the right answer — identify the circuit, read the graph, follow the construction. This is where general models fail. They describe a photograph fluently, but on a circuit or a geometry figure they misread the symbols, lose the spatial detail, or can't connect what they see to the principle that answers the question. Here is the kind of diagram we mean.

Physics · circuit schematic

A circuit drawn with standard symbols — resistors, a battery, a junction — where the answer depends on which components sit in series and which in parallel.

What it takes: recognise each symbol, read the topology, then apply the right law. Misread one symbol and the analysis is wrong from the start.

Mathematics · geometric construction

A figure with labelled points, angles and lines, where the relationships in the drawing — which segments are equal, which angle is marked — are the problem.

What it takes: read the construction precisely and reason from it. The base model often can't see these relationships at all.

Chemistry · molecular structure

A structure where the bonds, rings and groups define the molecule, and the answer options may themselves be diagrams.

What it takes: parse the structure exactly and tell near-identical molecules apart. A small misread changes the compound.

What makes these diagrams learnable is the reasoning that goes with them. Every question in the dataset carries a chain-of-thought answer that names the principle and walks from the diagram to the result, so the model learns not just the label but the path to it. That is what we fine-tune on, and where the approach begins.

The approach

What we did, in four steps.

01 · Curate

Diagrams across four sciences

22,679 multimodal questions, filtered from a pool of 180,505, spanning physics, mathematics, chemistry and biology — circuits, ray diagrams, geometric constructions, function graphs, molecular structures and cell diagrams — in three image roles: question diagrams, solution illustrations and option images.

02 · Reason-annotate

Chain-of-thought on every answer

Each answer carries a full chain-of-thought explanation that identifies the answer, names the underlying principle, and walks from the visual to the result. The model learns the path, not just the label.

03 · Fine-tune

Train the vision layers, not just the language

QLoRA fine-tuning of LLaMA-3.2-Vision-11B in 4-bit, rank 32, on a single A100. The key choice: we train the vision layers as well as the language layers. Most fine-tunes freeze the vision encoder; science diagrams sit too far from natural images for that, so the model has to adapt what it sees.

04 · Evaluate

Held-out, per subject

162 held-out questions the model never saw, each scored with its diagram, against the same model run zero-shot as the baseline. Broken down by subject, so it is clear where the gains land and where they don't.

The result

The gains land where the diagrams are hardest.

All figures are accuracy on the 162-question held-out set: the fine-tuned model against the same LLaMA-3.2-Vision-11B run zero-shot, each question scored with its diagram.

Held-out evaluation set (162 questions) · accuracy %, higher is better
Subject	Baseline	Nalanda Image VL	Δ
Mathematics	14.7	38.2	+23.5
Biology	37.8	51.4	+13.5
Physics	38.5	44.2	+5.8
Chemistry	59.0	56.4	−2.6
Overall	38.3	47.5	+9.3

How to read this. The held-out questions come from the same distribution as training, and the model never saw them; each is scored with its diagram present. The lift concentrates where the base model was weakest and the diagrams least like natural images: mathematics nearly tripled, and biology rose 13.5 points. Chemistry, where the base model was already strongest, dips slightly — a known effect when one strong subject is fine-tuned alongside three weaker ones. The overall gain is +9.3 points, on a deliberately held-out set of 162 questions.

The outcome

The model now reads the diagrams it used to misread.

Before fine-tuning, the model failed on science diagrams in specific, recognisable ways. Each is a place where reading the picture breaks down:

It misreads the visual vocabulary. A resistor symbol, a benzene ring or a force vector is taken for a generic shape, so the premise is wrong before any reasoning starts.
It loses the precise detail. A subscript, an axis label or which line is which gets read wrong, and a sound method lands on the wrong value.
It can't connect sight to principle. It describes the diagram but doesn't tie what it sees to the law or theorem that answers the question.
It can't tell look-alikes apart. Given near-identical diagrams as options — two molecules, two cell types — it picks the wrong one.

Fine-tuning on diagram-grounded reasoning fixes these where the diagrams are most distinctive. On the held-out set, overall accuracy rose 9.3 points; mathematics nearly tripled, from 14.7% to 38.2%, and biology climbed 13.5. The biggest gains came in exactly the subjects whose diagrams look least like natural images — which is the case for training the vision layers, not just the language ones. The same model that used to misread a geometry figure now reasons from it.

The model and a dataset sample are public. And because the gain came from diagram-grounded questions with reasoning traces, the same recipe carries to any field where the picture is the problem — engineering schematics, medical imaging, technical manuals, charts and graphs.

Reproduce it

Run it on your own diagrams.

The fine-tuned model and a dataset sample are public, so the result can be checked on diagrams of your own.

Model · fine-tuned LLaMA-3.2-Vision-11BOpen on Hugging Face → Dataset · public sampleOpen the dataset →

from transformers import AutoProcessor, MllamaForConditionalGeneration

repo = "Nalandadata/nalanda-image-vl"
model = MllamaForConditionalGeneration.from_pretrained(repo, device_map="auto")
processor = AutoProcessor.from_pretrained(repo)

msg = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "[Subject: Physics] Solve the circuit. Think step by step."}]}]
prompt = processor.apply_chat_template(msg, add_generation_prompt=True)
inputs = processor(diagram_image, prompt, return_tensors="pt").to(model.device)
print(processor.decode(model.generate(**inputs, max_new_tokens=512)[0],
                       skip_special_tokens=True))

The released checkpoint is the LLaMA-3.2-Vision-11B fine-tune. Prepending the subject tag, as in training, matches how the model was evaluated; the held-out result is +9.3 points overall, the number on this page.

Questions

The detail behind the result.

How was the model evaluated?+

On 162 held-out questions the model never saw during training, each scored with its diagram present, against the same LLaMA-3.2-Vision-11B run zero-shot as the baseline. We report accuracy per subject and overall.

Why train the vision layers?+

Because science diagrams are visually unlike the natural images the model was pre-trained on — thin lines, symbols, precise spatial relationships. Freezing the vision encoder, as most fine-tunes do, leaves those representations unchanged. Training the vision layers lets the model adapt what it sees, and it is where the largest gains came from.

Why did Mathematics gain the most?+

Because the base model started weakest there, at 14.7%, and maths diagrams — geometric constructions, function graphs — are among the least like natural images, so there was the most to learn. Accuracy nearly tripled, to 38.2%.

What happened with Chemistry?+

Chemistry dipped 2.6 points. The base model was already strongest there, at 59%, and training one strong subject alongside three weaker ones can erode it slightly. It points to subject-aware training as a next step; we report the full per-subject table rather than hide it.

Can we get the data, or data for our own domain?+

Yes. A dataset sample is public, and the curate-annotate-fine-tune pipeline extends to other diagram-heavy domains. Download a sample, or talk to a researcher to scope it.

Try it

Want a model that reads your diagrams?

Download a sample Talk to a researcher

Related research