Verified reasoning data that frontier models can’t synthesize.

Human-authored. Expert-verified. Reproducible. Difficulty-graded reasoning data with verified correct answers — the signal that makes a model measurably better, and that scraped or synthetic data can’t provide. Built for frontier labs, foundation-model teams and sovereign AI programmes.

Reasoning & CoT RLVR reward data Multimodal & vision-language Indic & sovereign AI See the proof →

Download a sample Talk to a researcher Or browse our published models & dataset samples on Hugging Face →

84.9%

TEDS · a public 8B model beats every cloud model we tested

+6.3 pts

JEE & NEET reasoning · from verified-reward RL

+9.3 pts

Science diagrams · a vision-language fine-tune

Latest NewDrishtiTable — 8B model tops every cloud model on table extraction NewNalanda Image VL — a vision model that reads science diagrams HFModels & dataset samples on Hugging Face

Reproducibility

Reproduce every result.

Our models, sample datasets and held-out numbers are public. Download them, run them on your own inputs, and reproduce what we report.

Published models

Fine-tuned weights, public

The fine-tuned models behind our results are released openly, so you can run them on your own inputs.

View the fine-tuned models →

Sample datasets

Real data, before you commit

A representative sample of each dataset is available to download, mirroring the full structure and quality.

Download a sample →

Held-out numbers

Methods and metrics, in full

Every case study shows the held-out test set, the metric, and the exact comparison conditions.

Read the research →

No install Try the GRPO-trained JEE/NEET solver live in your browser →

The fundamental difference

Data built to reason, not scraped to exist.

Synthetic data amplifies what a model already knows. Scraped web data is unverified and structurally incoherent. Neither gives a model the difficulty-graded, multi-step reasoning signal it needs, and neither carries the verified correct answers that make verifiable-reward training (RLVR) work.

Our human-authored worked solutions contain the expert reasoning explicitly: the thinking that produced the answer, not just the answer. The output is structured, pipeline-ready training data.

Source: S Chand Group archive · 11,000+ expert-authored titles · rights-cleared · zero scraping
Provenance: full copyright lineage with every dataset
Quality: difficulty graded by 2,000+ subject-matter experts
Contamination: deduplicated and checked against common benchmarks
Stages: SFT → RLVR → CoT → HITL → Multimodal

CapabilityWebUs

Clean IP, rights-cleared×✓

Verified correct answers×✓

Multi-step reasoning chains×✓

Difficulty stratification×✓

RLVR-ready reward signal×✓

Authentic Indic diversity×✓

Reproducible, public artifacts×✓

The scale behind it

2B+

Tokens of human reasoning data

12M+

Verified Q&A instruction pairs

28M+

Multimodal image-text pairs

Languages · 8 Indic scripts

Research

The experiments behind the claims.

Each experiment shows what verified, curriculum-grade data does to a real model — method, held-out numbers, and a public mirror on Hugging Face.

View all research →

84.9% TEDS

DrishtiTable

Table structure recognition

A fine-tuned 8B model tops every frontier cloud model on table extraction.

Read the case study →

+6.3 pts

NalandaBench

STEM reasoning · RLVR (GRPO)

Plain fine-tuning lost 16 points on the same data; verified-reward RL turned it into a 6.3-point gain.

Read the case study →

+9.3 pts

Nalanda Image VL

Multimodal science · vision-language

Fine-tuning a vision model on diagram data lifted held-out accuracy by 9.3 points.

Read the case study →

Who we build for

Built for every team training the next model.

Frontier & foundation-model labs

Post-training that web data can’t support

Verified-answer reasoning corpora for RLVR, SFT instruction pairs, and multimodal data, at a quality synthetic generation cannot reach.

RLVR reasoningJEE/NEET CoTAcademic QA SFT

Enterprise AI teams

Domain fine-tuning for production

Domain corpora with the conceptual depth and structured reasoning chains production models require, grounded in verified expert knowledge.

CommerceEngineeringEconomics

Government & sovereign AI

Bharat-first AI infrastructure

Curriculum-verified, Indic-script data across 8 scripts, aligned to national AI infrastructure goals.

Indic multilingualHistory & civicsIndic benchmarks

Dataset catalogue

Every training stage, one source of truth.

Reasoning + CoT

India-STEM Reasoning Corpus

Mathematics and science reasoning, foundational to advanced, with explicit step-by-step worked solutions and verified answers.

Size: 450M tokens
Language: English, Hindi
Difficulty: Easy → Advanced

Download sample →

Pretraining

Indic Multilingual Education Corpus

Structured corpora across 8 Indic languages, curriculum-graded and expert-authored.

Size: 520M tokens
Language: 8 Indic scripts
Format: Raw text corpus

Download sample →

Multimodal

Scientific Diagram QA

Labelled diagrams with grounded question-answer pairs for vision-language training and evaluation.

Size: 28M+ pairs
Type: Image + text
Domains: STEM

Download sample →

Stay close to the work

New experiments and dataset releases, as they ship.

A low-volume research update for ML teams — new results, new datasets, and what we are learning. No marketing.

Two ways in

Start with a sample. Or start a conversation.

Most teams begin by trying a sample on their own inputs. When you are ready, you talk to the people who built the data, not a sales desk.

Fastest · self-serve

Download a sample

Tell us where to send it and what you are working on. The sample arrives by email.

Higher intent

Talk to a researcher

For scoping a licence or a custom dataset. A few more details so we can come prepared.