Training

Transformer

The neural network architecture introduced in 2017 that powers nearly every modern LLM, image generator, and AI breakthrough.

01 ——

In plain English

The Transformer is a neural network architecture introduced by Google researchers in the 2017 paper "Attention is All You Need." It became the foundation for nearly every major AI breakthrough since: GPT, BERT, Claude, Gemini, Stable Diffusion, and most others all use transformers.

Why transformers won:

  • Attention mechanism — the model decides which parts of the input matter most for each output token
  • Parallelisation — earlier architectures (RNNs) processed text one word at a time; transformers process the whole input at once, training much faster
  • Scalability — quality keeps improving as you scale up data, parameters, and compute

Variants:

  • Decoder-only — GPT, Claude, Llama (used for chat and generation)
  • Encoder-only — BERT (used for classification and search)
  • Encoder-decoder — T5, BART (used for translation and summarisation)
  • Vision Transformers (ViT) — used in computer vision

The transformer is the most important AI invention of the last decade.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI