Case study · Science & maths reasoning

Teaching a small model to reason through advanced science and maths problems.

We wanted a small, cheap model — Qwen 2.5 7B — to solve advanced science and maths problems, the kind that take several steps of reasoning rather than a single recalled fact. Our material was 116,831 expert solutions from JEE and NEET, India’s toughest university-entrance exams, each with a verified correct answer. Trained on them the usual way, the model got worse. So we used the verified answers as a reward instead: scoring the model on whether it reached the right answer, which pushed it to reason its way there rather than memorise. That lifted it 6.3 points, from 60.5% to 66.8%, on a held-out set of 800 questions. The model, the data and the test set are all public.

Method

Verified-reward RL (GRPO)

Domain

Science & maths reasoning · JEE/NEET

Base model

Qwen 2.5 7B Instruct

Evaluation

800 held-out MCQs (NalandaBench)

−16.4pts

What standard supervised fine-tuning did to overall accuracy on the held-out set. The same data, the wrong method.

+6.3pts

The two-stage verified-reward model — one downloadable checkpoint, 60.5% → 66.8% on the held-out set.

2.5×

The lift from correcting mislabelled data alone, with no change to the method.

Reproducibility

Everything here is public.

The fine-tuned model and a data sample are public, including the held-out set the +6.3 is measured on.

Open the model on Hugging Face → Download a data sample →

The problem

What it takes to solve one of these problems.

Maths and science are the hardest test of whether a model can reason. Every problem has one correct answer, reached through a chain of steps, that the model either gets or it doesn’t. That makes a hard exam a clean way to measure what a model can really do, and what it takes to improve it.

We chose JEE and NEET because they are about as hard as competitive reasoning gets. JEE is the entrance exam for the IITs, and JEE Advanced is widely regarded as one of the toughest exams in the world. Around 1.5 million students sit JEE every year; only the top 250,000 even qualify to attempt the Advanced paper, about 54,000 clear it, and the IITs have roughly 18,000 seats. Of everyone who sits the exam, only about one in eighty ends up in an IIT. NEET, the gateway to medical school, draws over two million more each year. These exams are built so that memorisation and pattern-matching don’t pass: a single question folds two or three concepts into one chain. A model that does well on them is reasoning, not recalling.

What a lab training a reasoning model wants is straightforward: a model that can read a hard science or maths problem and work through to the answer. That means setting up the right equation, carrying a multi-step calculation without slipping, combining two or three concepts in one question, and landing on the single correct answer. This is where models fail. They know the individual facts, but on a multi-step problem they break somewhere in the chain and produce a clean, confident solution that reaches the wrong number. There is no partial credit; the answer is right or wrong. Here is the kind of problem we mean.

Physics · mechanics

A block slides down a frictionless incline of height h and sticks to an identical block resting at the bottom. Find their speed just after the collision.

What it takes: energy conservation to find the speed at the foot of the incline, then momentum conservation for the inelastic collision. Two principles, in order, in one question.

Chemistry · stoichiometry

What mass of oxygen is needed to completely combust 16 g of methane?

What it takes: balance CH4 + 2O2 → CO2 + 2H2O, convert grams to moles, apply the 1:2 ratio, convert back to grams. Mis-balance the first line and everything after it is wrong.

Mathematics · calculus

Find the local maximum value of f(x) = x³ − 3x.

What it takes: differentiate, solve f′(x) = 0 for the critical points, use the second derivative to tell a maximum from a minimum, then evaluate. Skip the test and you report the wrong point.

What makes these exams useful to us is that every question has one verified answer, with an expert worked solution behind it. That is what turns a hard test into training data, and it is where our method begins.

The approach

What we did, in four steps.

01 · Audit

Clean the archive before training

128,832 raw questions. A keyword audit found 39.4% carried the wrong subject label — Biology was 73.7% mislabelled — and 12,001 were non-STEM. Cleaning left 116,831 correctly labelled questions. This step alone, with no change to the method, lifted the eventual gain from +2.5 to +6.3 points.

02 · Light SFT

A short pass, with general data mixed in

200 steps on a 70/30 mix of domain and general data (SlimOrca), with NEFTune noise and a conservative LoRA setup (rank 8, attention-only). An aggressive setup — rank 32, all-linear targets — was what caused the 16-point collapse.

03 · Verified-reward RL

GRPO on correct answers

The model samples eight answers per question and is rewarded for reaching the verified-correct one, with smaller rewards for structure and reasoning. 600 steps on 10,000 multiple-choice questions.

04 · Evaluate

Held-out, plus public benchmarks

800 held-out MCQs, 200 per subject, released as NalandaBench. GSM8K, MMLU and ARC run alongside, to confirm general reasoning survived the specialisation.

The result

Standard fine-tuning made the model worse. Verified-reward RL improved it.

All figures are on the 800-question held-out set, against the Qwen 2.5 7B baseline. The verified-reward column is the single clean-data GRPO model — the one you can download.

Held-out evaluation set (800 MCQs) · accuracy %, higher is better
Subject	Baseline	Standard SFT	Verified-reward RL
Physics	51.0	34.5	62.0
Chemistry	61.5	50.0	71.5
Mathematics	56.0	40.5	56.0
Biology	73.5	51.5	77.5
Overall	60.5	44.1	66.8

How to read this. One representative model per column. Standard SFT (the aggressive variant) degraded every subject — overall −16.4. The verified-reward RL column is the clean-data GRPO model, +6.3 overall, and it is the checkpoint you can download and reproduce. A separate subject-balanced variant lifts Physics to 65.0 (+14.0) by trading some Mathematics; we report that as a per-subject ceiling, not the headline, because the headline is the single reproducible model.

General capability

What it did to general reasoning.

Mostly preserved or improved — with two regressions we report in full rather than hide.

Public benchmarks · baseline vs the verified-reward model · accuracy %
Benchmark	Baseline	RL model	Δ
GSM8K (math reasoning)	94.7	96.0	+1.3
ARC-Challenge (science)	90.0	90.0	0.0
MMLU-Physics	81.1	83.8	+2.7
MMLU-Chemistry	62.0	68.0	+6.0
MMLU-Mathematics	63.5	53.8	−9.7
MMLU-Biology	88.5	82.0	−6.5

Honest read. General reasoning held or improved on four of six. The two regressions — MMLU-Mathematics and MMLU-Biology — come from the subject-balanced variant, where pushing Physics trades against other subjects. We show the whole table.

The outcome

The model now reasons through problems it used to get wrong.

Before fine-tuning, the model failed in specific, recognisable ways. Each is a place where a chain of reasoning breaks:

Wrong principle, wrong from step one. It reaches for the wrong law to begin with (kinematics where energy conservation is needed), and nothing downstream can recover.
A dropped step. It skips a link in a multi-step derivation and lands on an answer that looks complete but isn’t.
An arithmetic or algebra slip. One manipulation error mid-chain carries straight through to the wrong final number.
Elimination instead of solving. It picks the most plausible-looking option rather than working the problem, which holds up on easy items but collapses on multi-concept ones.
Solution-shaped, still wrong. It produces a fluent, well-structured derivation that still lands on the wrong answer.

Verified-reward RL targets every one of these directly, because it rewards a single thing: reaching the correct answer. On the held-out set the model improved by 6.3 points overall, and the gains are largest exactly where the reasoning is hardest: physics up 11 points and chemistry up 10, the most multi-step and calculation-heavy subjects, with biology up 4. The same base model that used to produce confident wrong answers now works through these problems and lands on the right one.

The model is public, with a live demo to put your own problem to it. Because the gain came from verified answers and the right method, the same approach carries to any domain where the answers can be checked — a curriculum, a body of case law, a set of internal manuals.

Reproduce it

Run the model. Or try it in your browser.

The model and the held-out set are public, so the result can be checked on your own inputs.

Model · GRPO checkpointOpen on Hugging Face → NalandaBench · 800 held-out MCQsOpen the evaluation set →

from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Nalandadata/nalanda-qwen-7b-grpo"
model = AutoModelForCausalLM.from_pretrained(repo, device_map="auto")
tok   = AutoTokenizer.from_pretrained(repo)

msg = [{"role": "user", "content": "<JEE/NEET MCQ>  Think step by step, then answer A-D."}]
ids = tok.apply_chat_template(msg, add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(ids, max_new_tokens=512)[0], skip_special_tokens=True))

The released checkpoint is the clean-data GRPO model — +6.3 overall on the held-out set, the number on this page.

Questions

The detail behind the result.

Why did standard fine-tuning fail?+

Catastrophic forgetting. An aggressive LoRA setup trained on 100% domain data overwrote the model’s general reasoning, dropping accuracy 16 points below baseline. A conservative LoRA setup, a 30% general-data mix, and rewarding correct answers with RL rather than imitating solution text is what avoided it.

What was the single biggest lever?+

Data quality. Auditing and correcting the mislabelled subject tags — with no change to the model, hyperparameters or compute — more than doubled the gain, from +2.5 to +6.3 points. Biology, the most contaminated subject at 73.7%, swung from a regression to a genuine improvement.

Which number does the downloadable model give?+

+6.3 overall (60.5% → 66.8%) on the held-out set, against the Qwen 2.5 7B baseline. That is the clean-data GRPO model. A subject-balanced variant reaches +14 in Physics by trading some Mathematics; we headline the single reproducible model because it is the one you can verify.

Did specialising hurt general ability?+

Mostly no. GSM8K, MMLU-Physics and MMLU-Chemistry improved and ARC held flat. Two benchmarks regressed under the subject-balanced variant, MMLU-Mathematics and MMLU-Biology. The full table is on this page.

Can we get the data?+

The 800-question NalandaBench evaluation set is public, and a dataset sample is available. The full verified-solution corpus is licensed — download a sample or talk to a researcher to scope a slice for your domain.

Try it

Want verified-reward results on your domain?

Download a sample Talk to a researcher

Related research