title: "Temperature, Top-P, Top-K: AI Sampling Parameters Explained (2026)" slug: "48-temperature-top-p-top-k-explained" description: "Temperature, top-p, top-k explained for prompt engineers. What each does, when to tune, default values per task. With concrete examples." publishedAt: "2026-07-06" updatedAt: "2026-07-06" postNum: 48 pillar: 5 targetKeyword: "ai temperature top p" keywords:
- "ai temperature top p"
- "llm sampling parameters"
- "temperature top-p top-k"
- "openai temperature"
- "claude temperature" ogImage: "https://prompt-architects.com/og/48-temperature-top-p-top-k-explained.png" author: name: "Nafiul Hasan" role: "Founder, Prompt Architects" url: "https://prompt-architects.com/about" ctaFeature: "json" related: [42, 41, 47] faq:
- q: "What's the difference between temperature and top-p?" a: "Temperature scales the probability distribution before sampling — higher temp flattens it (more randomness), lower temp sharpens it (more deterministic). Top-p restricts which tokens are considered — only tokens whose cumulative probability reaches P. They control different things and interact: in practice tune temperature, leave top-p at default (0.95-1.0)."
- q: "What temperature should I use for code generation?" a: "0 (deterministic) for tasks where there's a single correct answer (extraction, classification, structured output). 0.2-0.4 for code where you want some variation but mostly correctness. 0.7+ for brainstorming, creative writing, and exploration. Most production LLM apps default to 0.2-0.7."
- q: "Does temperature affect cost?" a: "No. Temperature only affects which token the model picks at each step — not how many tokens it generates or which model it uses. Token count and model tier drive cost; temperature is free to tune."
- q: "Why don't all models support top-k?" a: "Top-k is more common in older or open-source models (Llama, older Mistral). OpenAI's chat completions API doesn't expose top-k. Anthropic's API does. Most users tune temperature + top-p; top-k is a less common knob in 2026."
- q: "How do I pick the right temperature for production?" a: "Start at 0.7 (default in most APIs). Run your prompt 5 times; if outputs vary too much, lower to 0.3. If outputs are too samey and miss edge cases, raise to 1.0. Production AI apps often pick per-task: 0 for extraction, 0.7 for support replies, 1.0 for creative variation."
TL;DR: Three sampling parameters control LLM output randomness. Temperature is the dial you tune. Top-p and top-k are tighter controls most users leave at default. Per-task settings below.
What sampling actually is
When an LLM generates text, it doesn't pick the next token deterministically. At each step, the model produces a probability distribution over all possible next tokens (50K+ for most models). Sampling parameters control which token is selected from that distribution.
The same prompt at the same model produces different output across runs because of randomness in this sampling step. Sampling parameters control how much randomness.
Temperature
What it does: scales the probability distribution before sampling.
- Temperature 0: deterministic. Always picks the most probable token. Same input → same output.
- Temperature 0.5: moderate randomness. Some variation, mostly safe choices.
- Temperature 1.0: full distribution as model trained. Default in most APIs.
- Temperature 1.5+: amplified randomness. Considers low-probability tokens. More creative, less reliable.
- Temperature 2.0: chaos. Often produces broken or off-topic output.
Range: 0 to 2 (most APIs cap at 2).
When to tune:
| Task | Recommended temperature |
|---|---|
| Structured extraction (JSON from text) | 0 |
| Classification (sentiment, category) | 0 - 0.2 |
| Code generation | 0.2 - 0.4 |
| Factual Q&A | 0.2 - 0.4 |
| Translation | 0.3 - 0.5 |
| Customer support replies | 0.5 - 0.7 |
| Marketing copy | 0.7 - 1.0 |
| Brainstorming | 1.0 - 1.3 |
| Creative writing | 1.0 - 1.5 |
| Bulk variant generation | 1.0 - 1.5 |
Practical rule: when there's one right answer, set temperature 0. When variety helps, raise it.
Top-P (Nucleus Sampling)
What it does: restricts which tokens are considered before sampling. Only tokens whose cumulative probability reaches P are eligible; the rest are discarded.
- Top-p 0.1: only the most-probable handful of tokens. Very narrow.
- Top-p 0.5: considers tokens that together account for 50% of probability mass.
- Top-p 0.95: default in most APIs. Considers all but the long tail of unlikely tokens.
- Top-p 1.0: full distribution. No truncation.
Range: 0 to 1.
When to tune: usually leave at default (0.95-1.0). Tuning top-p is less impactful than tuning temperature for most users.
Useful pairing:
- Temperature 0 + top-p 1.0 = pure greedy (most-probable each step)
- Temperature 0.7 + top-p 0.9 = balanced default
- Temperature 1.0 + top-p 0.5 = creative but bounded (won't go off the rails)
OpenAI's recommendation: tune one or the other, not both. In practice most users tune temperature.
Top-K Sampling
What it does: restricts which tokens are considered to the K most-probable, regardless of probability mass.
- Top-k 1: greedy — always pick the single most-probable token. Equivalent to temperature 0.
- Top-k 50: consider only the top 50 tokens.
- Top-k 1000+: effectively no truncation for most LLMs (typical vocabulary supports ~50K tokens).
Range: 1 to vocab size.
When to tune: rarely. Top-k is a coarser version of top-p. Most modern APIs (OpenAI's chat completions) don't expose top-k. Anthropic's API does.
When you might use top-k:
- Open-source models where top-p isn't well-tuned for your use case
- Reproducing specific behavior from a research paper that specified top-k
- Working with very small vocabularies where top-p isn't granular enough
How they interact
LLM sampling pipeline:
- Model produces probability distribution over all tokens
- Temperature scales the distribution (higher = flatter)
- Top-p truncates to nucleus (smallest set with cumulative probability ≥ P)
- Top-k truncates to K most-probable
- Sample from remaining distribution
If both top-p and top-k are set, both apply. Usually one is enough.
Practical defaults
For everyday tasks
temperature: 0.7
top_p: 0.95 (or default)
top_k: 50 (if model supports; otherwise omit)
For deterministic output (extraction, classification)
temperature: 0
top_p: 1.0
For creative variation
temperature: 1.0
top_p: 0.95
For maximum randomness (don't really use in production)
temperature: 1.5
top_p: 1.0
Use case examples
Use case 1: Extracting entities from email
// Production extraction — want deterministic shape
const response = await openai.chat.completions.create({
model: "gpt-5",
messages: [...],
response_format: { type: "json_schema", json_schema: { ... } },
temperature: 0, // determinism critical
});
Use case 2: Generating 10 ad variants
// Creative variation desired
const response = await openai.chat.completions.create({
model: "gpt-5",
messages: [...],
temperature: 1.0, // variety wanted
n: 10, // 10 completions in one request
});
Use case 3: Customer support reply
// Balanced — reliable but not robotic
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [...],
temperature: 0.5,
});
Use case 4: Code refactor
// Code wants correctness; some variation OK for style
const response = await anthropic.messages.create({
model: "claude-opus-4",
messages: [...],
temperature: 0.3,
top_p: 0.95,
});
Self-consistency: temperature 0.7 × N votes
For high-stakes reasoning tasks (math, code logic), generate the same prompt 5+ times at temperature 0.7-1.0. Take the majority answer across runs. Eliminates one-off reasoning errors that any single chain might make.
const responses = await Promise.all(
Array.from({ length: 5 }).map(() =>
openai.chat.completions.create({
model: "gpt-5",
messages: [{ role: "user", content: cotPrompt }],
temperature: 0.7,
})
)
);
const answers = responses.map((r) => extractAnswer(r));
const consensus = mode(answers);
This trick (called self-consistency prompting) is standard for high-accuracy math/logic in production.
Common mistakes
- Setting temperature 0 for creative tasks. Output is deterministic but bland. Use 0.7-1.0 for marketing, 1.0+ for ideation.
- Setting temperature 1.5+ in production. Edge case in training; output gets weird. Cap at 1.2 for stability.
- Tuning both top-p and top-k. Pick one. Most users tune temperature only.
- Forgetting temperature is free. Tune per-task, not per-app. Cost is the same.
- Production code with random temperature. Same prompt at temperature 0.7 produces different output across runs. Critical operations should set seed (where supported) or use temperature 0.
What changed in 2025-2026
- Reasoning modes (GPT-5 reasoning, Claude Opus 4 thinking): temperature applies to the visible output, not the internal reasoning. Internal reasoning is often more deterministic regardless.
- Structured output APIs (json_schema mode): temperature 0 + structured output guarantees parseable output. Gold standard for production extraction.
- Open-source models (Llama 4, Qwen 3) often expose top-k. Cloud-hosted frontier models mostly hide it.
- Multi-sample APIs (n parameter on OpenAI) make self-consistency easy: one request, 5 outputs at temperature 0.7.
Quick reference card
| Goal | Temperature | Top-P | Notes |
|---|---|---|---|
| Deterministic extraction | 0 | 1.0 | Pair with structured output |
| Code (correctness-focused) | 0.2-0.4 | default | |
| Translation | 0.3-0.5 | default | |
| Customer support | 0.5-0.7 | default | |
| Marketing copy | 0.7-1.0 | default | |
| Brainstorming | 1.0-1.3 | 0.9-1.0 | Raise both for variety |
| Bulk variant generation | 1.0+ | 1.0 | Set n=10 to batch |
| Creative writing | 1.0-1.5 | 1.0 | Cap at 1.5 |
| Self-consistency reasoning | 0.7 | default | Generate 5x, vote |
Save this card. 90% of sampling decisions covered.
What to do next
- Audit your production prompts. Are you using temperature 0.7 for everything? Probably wasting accuracy on deterministic tasks and creativity on extraction.
- A/B by temperature. Same prompt at 0, 0.5, 1.0. Note which suits each task.
- For high-stakes math/logic, implement self-consistency (5x at 0.7, vote).
- For production extraction, set temperature 0 + structured output mode.
Tools that ship per-task temperature presets (Prompt Architects) save the manual setting work — but understanding what each parameter does matters more than the tool you use.