Case study · Science & maths reasoning
Teaching a small model to reason through advanced science and maths problems.
We wanted a small, cheap model — Qwen 2.5 7B — to solve advanced science and maths problems, the kind that take several steps of reasoning rather than a single recalled fact. Our material was 116,831 expert solutions from JEE and NEET, India’s toughest university-entrance exams, each with a verified correct answer. Trained on them the usual way, the model got worse. So we used the verified answers as a reward instead: scoring the model on whether it reached the right answer, which pushed it to reason its way there rather than memorise. That lifted it 6.3 points, from 60.5% to 66.8%, on a held-out set of 800 questions. The model, the data and the test set are all public.
Reproducibility
Everything here is public.
The fine-tuned model and a data sample are public, including the held-out set the +6.3 is measured on.
The problem
What it takes to solve one of these problems.
Maths and science are the hardest test of whether a model can reason. Every problem has one correct answer, reached through a chain of steps, that the model either gets or it doesn’t. That makes a hard exam a clean way to measure what a model can really do, and what it takes to improve it.
We chose JEE and NEET because they are about as hard as competitive reasoning gets. JEE is the entrance exam for the IITs, and JEE Advanced is widely regarded as one of the toughest exams in the world. Around 1.5 million students sit JEE every year; only the top 250,000 even qualify to attempt the Advanced paper, about 54,000 clear it, and the IITs have roughly 18,000 seats. Of everyone who sits the exam, only about one in eighty ends up in an IIT. NEET, the gateway to medical school, draws over two million more each year. These exams are built so that memorisation and pattern-matching don’t pass: a single question folds two or three concepts into one chain. A model that does well on them is reasoning, not recalling.
What a lab training a reasoning model wants is straightforward: a model that can read a hard science or maths problem and work through to the answer. That means setting up the right equation, carrying a multi-step calculation without slipping, combining two or three concepts in one question, and landing on the single correct answer. This is where models fail. They know the individual facts, but on a multi-step problem they break somewhere in the chain and produce a clean, confident solution that reaches the wrong number. There is no partial credit; the answer is right or wrong. Here is the kind of problem we mean.
A block slides down a frictionless incline of height h and sticks to an identical block resting at the bottom. Find their speed just after the collision.
What it takes: energy conservation to find the speed at the foot of the incline, then momentum conservation for the inelastic collision. Two principles, in order, in one question.
What mass of oxygen is needed to completely combust 16 g of methane?
What it takes: balance CH4 + 2O2 → CO2 + 2H2O, convert grams to moles, apply the 1:2 ratio, convert back to grams. Mis-balance the first line and everything after it is wrong.
Find the local maximum value of f(x) = x³ − 3x.
What it takes: differentiate, solve f′(x) = 0 for the critical points, use the second derivative to tell a maximum from a minimum, then evaluate. Skip the test and you report the wrong point.
What makes these exams useful to us is that every question has one verified answer, with an expert worked solution behind it. That is what turns a hard test into training data, and it is where our method begins.
The approach
What we did, in four steps.
Clean the archive before training
128,832 raw questions. A keyword audit found 39.4% carried the wrong subject label — Biology was 73.7% mislabelled — and 12,001 were non-STEM. Cleaning left 116,831 correctly labelled questions. This step alone, with no change to the method, lifted the eventual gain from +2.5 to +6.3 points.
A short pass, with general data mixed in
200 steps on a 70/30 mix of domain and general data (SlimOrca), with NEFTune noise and a conservative LoRA setup (rank 8, attention-only). An aggressive setup — rank 32, all-linear targets — was what caused the 16-point collapse.
GRPO on correct answers
The model samples eight answers per question and is rewarded for reaching the verified-correct one, with smaller rewards for structure and reasoning. 600 steps on 10,000 multiple-choice questions.
Held-out, plus public benchmarks
800 held-out MCQs, 200 per subject, released as NalandaBench. GSM8K, MMLU and ARC run alongside, to confirm general reasoning survived the specialisation.
The result
Standard fine-tuning made the model worse. Verified-reward RL improved it.
All figures are on the 800-question held-out set, against the Qwen 2.5 7B baseline. The verified-reward column is the single clean-data GRPO model — the one you can download.
| Subject | Baseline | Standard SFT | Verified-reward RL |
|---|---|---|---|
| Physics | 51.0 | 34.5 | 62.0 |
| Chemistry | 61.5 | 50.0 | 71.5 |
| Mathematics | 56.0 | 40.5 | 56.0 |
| Biology | 73.5 | 51.5 | 77.5 |
| Overall | 60.5 | 44.1 | 66.8 |
How to read this. One representative model per column. Standard SFT (the aggressive variant) degraded every subject — overall −16.4. The verified-reward RL column is the clean-data GRPO model, +6.3 overall, and it is the checkpoint you can download and reproduce. A separate subject-balanced variant lifts Physics to 65.0 (+14.0) by trading some Mathematics; we report that as a per-subject ceiling, not the headline, because the headline is the single reproducible model.
General capability
What it did to general reasoning.
Mostly preserved or improved — with two regressions we report in full rather than hide.
| Benchmark | Baseline | RL model | Δ |
|---|---|---|---|
| GSM8K (math reasoning) | 94.7 | 96.0 | +1.3 |
| ARC-Challenge (science) | 90.0 | 90.0 | 0.0 |
| MMLU-Physics | 81.1 | 83.8 | +2.7 |
| MMLU-Chemistry | 62.0 | 68.0 | +6.0 |
| MMLU-Mathematics | 63.5 | 53.8 | −9.7 |
| MMLU-Biology | 88.5 | 82.0 | −6.5 |
Honest read. General reasoning held or improved on four of six. The two regressions — MMLU-Mathematics and MMLU-Biology — come from the subject-balanced variant, where pushing Physics trades against other subjects. We show the whole table.
The outcome
The model now reasons through problems it used to get wrong.
Before fine-tuning, the model failed in specific, recognisable ways. Each is a place where a chain of reasoning breaks:
- Wrong principle, wrong from step one. It reaches for the wrong law to begin with (kinematics where energy conservation is needed), and nothing downstream can recover.
- A dropped step. It skips a link in a multi-step derivation and lands on an answer that looks complete but isn’t.
- An arithmetic or algebra slip. One manipulation error mid-chain carries straight through to the wrong final number.
- Elimination instead of solving. It picks the most plausible-looking option rather than working the problem, which holds up on easy items but collapses on multi-concept ones.
- Solution-shaped, still wrong. It produces a fluent, well-structured derivation that still lands on the wrong answer.
Verified-reward RL targets every one of these directly, because it rewards a single thing: reaching the correct answer. On the held-out set the model improved by 6.3 points overall, and the gains are largest exactly where the reasoning is hardest: physics up 11 points and chemistry up 10, the most multi-step and calculation-heavy subjects, with biology up 4. The same base model that used to produce confident wrong answers now works through these problems and lands on the right one.
The model is public, with a live demo to put your own problem to it. Because the gain came from verified answers and the right method, the same approach carries to any domain where the answers can be checked — a curriculum, a body of case law, a set of internal manuals.
Reproduce it
Run the model. Or try it in your browser.
The model and the held-out set are public, so the result can be checked on your own inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "Nalandadata/nalanda-qwen-7b-grpo"
model = AutoModelForCausalLM.from_pretrained(repo, device_map="auto")
tok = AutoTokenizer.from_pretrained(repo)
msg = [{"role": "user", "content": "<JEE/NEET MCQ> Think step by step, then answer A-D."}]
ids = tok.apply_chat_template(msg, add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(ids, max_new_tokens=512)[0], skip_special_tokens=True))
The released checkpoint is the clean-data GRPO model — +6.3 overall on the held-out set, the number on this page.
Questions
The detail behind the result.
Why did standard fine-tuning fail?+
What was the single biggest lever?+
Which number does the downloadable model give?+
Did specialising hurt general ability?+
Can we get the data?+
Related research