Prompting

Top-p Sampling

A decoding strategy (also called nucleus sampling) that picks the next token from the smallest set of candidates whose cumulative probability exceeds P.

01 ——

In plain English

Top-p sampling (also nucleus sampling) is the most common randomness control in production LLM apps. Instead of capping the candidates at a fixed count (top-k), it caps them at a cumulative probability — the model considers as many tokens as it takes to add up to P (typically 0.9 or 0.95).

Why it's the default:

  • Adapts to confidence — when the model is very sure, only 1–2 tokens make the cut; when it's unsure, more tokens are eligible
  • Avoids the "K too small for some contexts, too big for others" problem of top-k

Typical settings:

  • top_p = 0.9 — the most common default; conservative but not robotic
  • top_p = 0.95 — slightly more diverse
  • top_p = 1.0 — no nucleus filtering; falls back to temperature only

How it pairs with temperature:

  • Temperature reshapes the probability distribution (peak it or flatten it)
  • Top-p then trims the tail
  • Common pattern: leave top_p at 0.9 default, tune temperature for creative vs precise tasks

Almost every chat completion API supports top-p; for most production use the default is fine and temperature is the dial you actually move.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI