‌
‌

Training

Training Data

The dataset an AI model learns from — its quality, diversity, and biases directly shape what the model can do and how well it does it.

01 ——

In plain English

Training data is the set of examples used to teach an AI model. For an LLM, that's hundreds of billions of words from books, websites, code, and more. For an image generator, it's billions of images with captions. The model's capabilities, blind spots, and biases all trace back to this data.

What makes training data matter:

Quality — clean, well-labelled data produces better models than noisy data
Diversity — covering more languages, topics, and demographics broadens capability
Recency — anything not in the data is the model's "knowledge cutoff"
Licensing — using copyrighted data without permission is now in active litigation

Common training data sources:

Web crawls (Common Crawl, etc.)
Books and academic papers
Code repositories (GitHub)
Image datasets (LAION, ImageNet)
Synthetic data (generated by other models)

How a model was trained is increasingly a competitive secret — and a legal flashpoint.

02 ——

Related terms

Machine Learning

A type of AI where systems learn patterns from data rather than being explicitly programmed with rules.

Further training a pre-trained AI model on your own data to specialise it for a specific task or style.

Large Language Model — the type of AI behind tools like ChatGPT and Claude, trained to understand and generate text.

Foundation Model

A large, general-purpose AI model trained on broad data that can be adapted (via prompting or fine-tuning) to many downstream tasks.

When an AI model's outputs systematically reflect unfair patterns from its training data — about gender, race, age, or other groups.

Back to glossaryLast reviewed June 2026

Vol. 4 · Issue 21 · Last reviewed 2026-06-27

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI

AI Tools Directory

The AI tools directory for discovering, exploring, and comparing the most innovative AI tools in the industry

Explore

All AI tools

Top 100 AI tools

Best AI tools

Curated collections

AI tool alternatives

AI categories

Pricing

AI glossary

Compare AI tools

Blog

Methodology

Editorial team

AI graveyard

Research

MCP server

Latest collections

Policy

Terms & conditions

Privacy policy

FAQ

Refund policy

Affiliate disclosure