Training

Training Data

The dataset an AI model learns from — its quality, diversity, and biases directly shape what the model can do and how well it does it.

01 ——

In plain English

Training data is the set of examples used to teach an AI model. For an LLM, that's hundreds of billions of words from books, websites, code, and more. For an image generator, it's billions of images with captions. The model's capabilities, blind spots, and biases all trace back to this data.

What makes training data matter:

  • Quality — clean, well-labelled data produces better models than noisy data
  • Diversity — covering more languages, topics, and demographics broadens capability
  • Recency — anything not in the data is the model's "knowledge cutoff"
  • Licensing — using copyrighted data without permission is now in active litigation

Common training data sources:

  • Web crawls (Common Crawl, etc.)
  • Books and academic papers
  • Code repositories (GitHub)
  • Image datasets (LAION, ImageNet)
  • Synthetic data (generated by other models)

How a model was trained is increasingly a competitive secret — and a legal flashpoint.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI