Top-k Sampling
A decoding strategy that picks the next token only from the top K most likely candidates — trading diversity for focus.
In plain English
Top-k sampling is one of the basic levers for controlling LLM output randomness. At each step, the model produces a probability for every possible next token. Top-k throws away all but the K highest-probability tokens and samples from those.
How it works:
- K=1 is equivalent to greedy decoding (always pick the most likely token)
- K=40 is a common default — enough diversity, still focused
- K=∞ means no restriction; you're back to sampling from the full distribution
- Lower K = more deterministic, higher K = more diverse
Where you see it:
Most chat APIs (OpenAI, Anthropic, Google, OpenRouter) expose top_k as a parameter, though defaults are usually fine. Worth tuning if you're getting repetitive output (raise K) or too-creative drift (lower K).
Top-k vs top-p:
- Top-k is a fixed count (always K candidates)
- Top-p is a fixed cumulative probability (varies by how peaked the distribution is)
In practice, top-p is more commonly tuned in production because it adapts to the model's confidence.