Training

Pre-training

The first and most expensive phase of building a model — learning language and world knowledge by predicting the next token across trillions of words.

01 ——

In plain English

Pre-training is the foundational stage where a base model learns to predict the next token across an enormous corpus of text, code, and (increasingly) images, audio, and video. It's where the bulk of a model's knowledge comes from — and where most of the cost goes.

Scale of pre-training:

Data — 10–30+ trillion tokens for frontier models
Compute — thousands of GPUs running for weeks to months
Cost — tens to hundreds of millions of dollars per training run
Output — a "base" model that's fluent but not yet aligned to be helpful

What goes into the data mix: Web crawl (filtered), books, Wikipedia, academic papers, code (GitHub, Stack Overflow), curated educational content, and increasingly synthetic data. The exact mix is one of the most closely guarded competitive secrets at frontier labs.

Why it matters: Pre-training sets the ceiling. Post-training (fine-tuning, RLHF) shapes the model's behaviour but rarely adds new knowledge. If a fact wasn't in pre-training data, the model probably doesn't know it.

02 ——