Quantization
Shrinking an AI model by storing its weights in lower-precision numbers — making it smaller, faster, and cheaper with minimal quality loss.
In plain English
Quantization compresses an AI model by reducing the precision of its numbers — for example, converting 32-bit floating-point weights to 8-bit or 4-bit integers. The model becomes much smaller and faster to run, usually with only a small drop in quality.
Why it matters:
- Run on smaller hardware — quantized models can fit on a laptop, phone, or single GPU
- Lower inference cost — faster math = more requests per second per chip
- Edge AI — quantization is essential for on-device models like Apple Intelligence
Common levels:
- FP16 / BF16 — half-precision, near-original quality
- INT8 — usually negligible quality loss
- INT4 — significant compression, small but noticeable quality drop
Most open-weight models you can download (Llama, Mistral, Qwen) ship in multiple quantized variants so you can pick the size that fits your hardware.