Case study · Science diagram understanding
Teaching a vision model to read science diagrams.
We wanted a model that could look at a science diagram — a circuit, a molecule, a geometry figure — and reason its way to the answer. That’s hard, because science diagrams use their own visual shorthand, nothing like the everyday photos these models learn from. So we built 22,679 science questions, each paired with its diagram and a step-by-step worked answer, and fine-tuned a LLaMA-3.2-Vision-11B model on them. The key move: we trained the part of the model that sees, not just the part that reads, since the seeing is where these models fall short on diagrams. On a held-out set its accuracy rose from 38.3% to 47.5%, a 9.3-point gain, with mathematics nearly tripling. The model and a dataset sample are public.
Reproducibility
Everything here is public.
The fine-tuned model and a dataset sample are public, and the held-out questions are the same ones the baseline was scored on.
The problem
What it takes to read a science diagram.
A large share of science is diagram-dependent. A physics question hinges on a circuit, a maths question on a geometric figure, a chemistry question on a molecular structure. To answer it, a model has to read the diagram correctly first; everything after that depends on getting the picture right.
Vision-language models read natural photographs well, and they break on scientific diagrams, because diagrams use a visual vocabulary of their own. A zig-zag line is a resistor, a hexagonal ring is benzene, an arc with a label is an angle. The symbols carry exact meaning, the spatial relationships matter, and a single misread — a subscript, an axis, which line is which — sends the whole answer wrong. None of it looks like the natural images these models were trained on.
What anyone building science tools wants is plain: show the model the diagram and the question, and have it reason from what it sees to the right answer — identify the circuit, read the graph, follow the construction. This is where general models fail. They describe a photograph fluently, but on a circuit or a geometry figure they misread the symbols, lose the spatial detail, or can't connect what they see to the principle that answers the question. Here is the kind of diagram we mean.
A circuit drawn with standard symbols — resistors, a battery, a junction — where the answer depends on which components sit in series and which in parallel.
What it takes: recognise each symbol, read the topology, then apply the right law. Misread one symbol and the analysis is wrong from the start.
A figure with labelled points, angles and lines, where the relationships in the drawing — which segments are equal, which angle is marked — are the problem.
What it takes: read the construction precisely and reason from it. The base model often can't see these relationships at all.
A structure where the bonds, rings and groups define the molecule, and the answer options may themselves be diagrams.
What it takes: parse the structure exactly and tell near-identical molecules apart. A small misread changes the compound.
What makes these diagrams learnable is the reasoning that goes with them. Every question in the dataset carries a chain-of-thought answer that names the principle and walks from the diagram to the result, so the model learns not just the label but the path to it. That is what we fine-tune on, and where the approach begins.
The approach
What we did, in four steps.
Diagrams across four sciences
22,679 multimodal questions, filtered from a pool of 180,505, spanning physics, mathematics, chemistry and biology — circuits, ray diagrams, geometric constructions, function graphs, molecular structures and cell diagrams — in three image roles: question diagrams, solution illustrations and option images.
Chain-of-thought on every answer
Each answer carries a full chain-of-thought explanation that identifies the answer, names the underlying principle, and walks from the visual to the result. The model learns the path, not just the label.
Train the vision layers, not just the language
QLoRA fine-tuning of LLaMA-3.2-Vision-11B in 4-bit, rank 32, on a single A100. The key choice: we train the vision layers as well as the language layers. Most fine-tunes freeze the vision encoder; science diagrams sit too far from natural images for that, so the model has to adapt what it sees.
Held-out, per subject
162 held-out questions the model never saw, each scored with its diagram, against the same model run zero-shot as the baseline. Broken down by subject, so it is clear where the gains land and where they don't.
The result
The gains land where the diagrams are hardest.
All figures are accuracy on the 162-question held-out set: the fine-tuned model against the same LLaMA-3.2-Vision-11B run zero-shot, each question scored with its diagram.
| Subject | Baseline | Nalanda Image VL | Δ |
|---|---|---|---|
| Mathematics | 14.7 | 38.2 | +23.5 |
| Biology | 37.8 | 51.4 | +13.5 |
| Physics | 38.5 | 44.2 | +5.8 |
| Chemistry | 59.0 | 56.4 | −2.6 |
| Overall | 38.3 | 47.5 | +9.3 |
How to read this. The held-out questions come from the same distribution as training, and the model never saw them; each is scored with its diagram present. The lift concentrates where the base model was weakest and the diagrams least like natural images: mathematics nearly tripled, and biology rose 13.5 points. Chemistry, where the base model was already strongest, dips slightly — a known effect when one strong subject is fine-tuned alongside three weaker ones. The overall gain is +9.3 points, on a deliberately held-out set of 162 questions.
The outcome
The model now reads the diagrams it used to misread.
Before fine-tuning, the model failed on science diagrams in specific, recognisable ways. Each is a place where reading the picture breaks down:
- It misreads the visual vocabulary. A resistor symbol, a benzene ring or a force vector is taken for a generic shape, so the premise is wrong before any reasoning starts.
- It loses the precise detail. A subscript, an axis label or which line is which gets read wrong, and a sound method lands on the wrong value.
- It can't connect sight to principle. It describes the diagram but doesn't tie what it sees to the law or theorem that answers the question.
- It can't tell look-alikes apart. Given near-identical diagrams as options — two molecules, two cell types — it picks the wrong one.
Fine-tuning on diagram-grounded reasoning fixes these where the diagrams are most distinctive. On the held-out set, overall accuracy rose 9.3 points; mathematics nearly tripled, from 14.7% to 38.2%, and biology climbed 13.5. The biggest gains came in exactly the subjects whose diagrams look least like natural images — which is the case for training the vision layers, not just the language ones. The same model that used to misread a geometry figure now reasons from it.
The model and a dataset sample are public. And because the gain came from diagram-grounded questions with reasoning traces, the same recipe carries to any field where the picture is the problem — engineering schematics, medical imaging, technical manuals, charts and graphs.
Reproduce it
Run it on your own diagrams.
The fine-tuned model and a dataset sample are public, so the result can be checked on diagrams of your own.
from transformers import AutoProcessor, MllamaForConditionalGeneration
repo = "Nalandadata/nalanda-image-vl"
model = MllamaForConditionalGeneration.from_pretrained(repo, device_map="auto")
processor = AutoProcessor.from_pretrained(repo)
msg = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "[Subject: Physics] Solve the circuit. Think step by step."}]}]
prompt = processor.apply_chat_template(msg, add_generation_prompt=True)
inputs = processor(diagram_image, prompt, return_tensors="pt").to(model.device)
print(processor.decode(model.generate(**inputs, max_new_tokens=512)[0],
skip_special_tokens=True))
The released checkpoint is the LLaMA-3.2-Vision-11B fine-tune. Prepending the subject tag, as in training, matches how the model was evaluated; the held-out result is +9.3 points overall, the number on this page.
Questions
The detail behind the result.
How was the model evaluated?+
Why train the vision layers?+
Why did Mathematics gain the most?+
What happened with Chemistry?+
Can we get the data, or data for our own domain?+
Related research