Infra & cost

Quantization

Shrinking an AI model by storing its weights in lower-precision numbers — making it smaller, faster, and cheaper with minimal quality loss.

01 ——

In plain English

Quantization compresses an AI model by reducing the precision of its numbers — for example, converting 32-bit floating-point weights to 8-bit or 4-bit integers. The model becomes much smaller and faster to run, usually with only a small drop in quality.

Why it matters:

  • Run on smaller hardware — quantized models can fit on a laptop, phone, or single GPU
  • Lower inference cost — faster math = more requests per second per chip
  • Edge AI — quantization is essential for on-device models like Apple Intelligence

Common levels:

  • FP16 / BF16 — half-precision, near-original quality
  • INT8 — usually negligible quality loss
  • INT4 — significant compression, small but noticeable quality drop

Most open-weight models you can download (Llama, Mistral, Qwen) ship in multiple quantized variants so you can pick the size that fits your hardware.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI