Pre-training
The first and most expensive phase of building a model — learning language and world knowledge by predicting the next token across trillions of words.
In plain English
Pre-training is the foundational stage where a base model learns to predict the next token across an enormous corpus of text, code, and (increasingly) images, audio, and video. It's where the bulk of a model's knowledge comes from — and where most of the cost goes.
Scale of pre-training:
- Data — 10–30+ trillion tokens for frontier models
- Compute — thousands of GPUs running for weeks to months
- Cost — tens to hundreds of millions of dollars per training run
- Output — a "base" model that's fluent but not yet aligned to be helpful
What goes into the data mix: Web crawl (filtered), books, Wikipedia, academic papers, code (GitHub, Stack Overflow), curated educational content, and increasingly synthetic data. The exact mix is one of the most closely guarded competitive secrets at frontier labs.
Why it matters: Pre-training sets the ceiling. Post-training (fine-tuning, RLHF) shapes the model's behaviour but rarely adds new knowledge. If a fact wasn't in pre-training data, the model probably doesn't know it.