Training

Pre-training

The first and most expensive phase of building a model — learning language and world knowledge by predicting the next token across trillions of words.

01 ——

In plain English

Pre-training is the foundational stage where a base model learns to predict the next token across an enormous corpus of text, code, and (increasingly) images, audio, and video. It's where the bulk of a model's knowledge comes from — and where most of the cost goes.

Scale of pre-training:

  • Data — 10–30+ trillion tokens for frontier models
  • Compute — thousands of GPUs running for weeks to months
  • Cost — tens to hundreds of millions of dollars per training run
  • Output — a "base" model that's fluent but not yet aligned to be helpful

What goes into the data mix: Web crawl (filtered), books, Wikipedia, academic papers, code (GitHub, Stack Overflow), curated educational content, and increasingly synthetic data. The exact mix is one of the most closely guarded competitive secrets at frontier labs.

Why it matters: Pre-training sets the ceiling. Post-training (fine-tuning, RLHF) shapes the model's behaviour but rarely adds new knowledge. If a fact wasn't in pre-training data, the model probably doesn't know it.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI