Prompting

Top-k Sampling

A decoding strategy that picks the next token only from the top K most likely candidates — trading diversity for focus.

01 ——

In plain English

Top-k sampling is one of the basic levers for controlling LLM output randomness. At each step, the model produces a probability for every possible next token. Top-k throws away all but the K highest-probability tokens and samples from those.

How it works:

  • K=1 is equivalent to greedy decoding (always pick the most likely token)
  • K=40 is a common default — enough diversity, still focused
  • K=∞ means no restriction; you're back to sampling from the full distribution
  • Lower K = more deterministic, higher K = more diverse

Where you see it: Most chat APIs (OpenAI, Anthropic, Google, OpenRouter) expose top_k as a parameter, though defaults are usually fine. Worth tuning if you're getting repetitive output (raise K) or too-creative drift (lower K).

Top-k vs top-p:

  • Top-k is a fixed count (always K candidates)
  • Top-p is a fixed cumulative probability (varies by how peaked the distribution is)

In practice, top-p is more commonly tuned in production because it adapts to the model's confidence.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI