Top-p Sampling
A decoding strategy (also called nucleus sampling) that picks the next token from the smallest set of candidates whose cumulative probability exceeds P.
In plain English
Top-p sampling (also nucleus sampling) is the most common randomness control in production LLM apps. Instead of capping the candidates at a fixed count (top-k), it caps them at a cumulative probability — the model considers as many tokens as it takes to add up to P (typically 0.9 or 0.95).
Why it's the default:
- Adapts to confidence — when the model is very sure, only 1–2 tokens make the cut; when it's unsure, more tokens are eligible
- Avoids the "K too small for some contexts, too big for others" problem of top-k
Typical settings:
- top_p = 0.9 — the most common default; conservative but not robotic
- top_p = 0.95 — slightly more diverse
- top_p = 1.0 — no nucleus filtering; falls back to temperature only
How it pairs with temperature:
- Temperature reshapes the probability distribution (peak it or flatten it)
- Top-p then trims the tail
- Common pattern: leave top_p at 0.9 default, tune temperature for creative vs precise tasks
Almost every chat completion API supports top-p; for most production use the default is fine and temperature is the dial you actually move.