Prompting

Top-p Sampling

A decoding strategy (also called nucleus sampling) that picks the next token from the smallest set of candidates whose cumulative probability exceeds P.

01 ——

In plain English

Top-p sampling (also nucleus sampling) is the most common randomness control in production LLM apps. Instead of capping the candidates at a fixed count (top-k), it caps them at a cumulative probability — the model considers as many tokens as it takes to add up to P (typically 0.9 or 0.95).

Why it's the default:

Adapts to confidence — when the model is very sure, only 1–2 tokens make the cut; when it's unsure, more tokens are eligible
Avoids the "K too small for some contexts, too big for others" problem of top-k

Typical settings:

top_p = 0.9 — the most common default; conservative but not robotic
top_p = 0.95 — slightly more diverse
top_p = 1.0 — no nucleus filtering; falls back to temperature only

How it pairs with temperature:

Temperature reshapes the probability distribution (peak it or flatten it)
Top-p then trims the tail
Common pattern: leave top_p at 0.9 default, tune temperature for creative vs precise tasks

Almost every chat completion API supports top-p; for most production use the default is fine and temperature is the dial you actually move.

02 ——