Back to blog
EngineeringUpdated June 10, 202621 min read

Temperature, Top-P, Top-K: AI Sampling Parameters Explained (2026)

Temperature, top-p, top-k explained for prompt engineers. What each does, when to tune, default values per task. With concrete examples.

NH
Nafiul Hasan
Founder, Prompt Architects

TL;DR: Three sampling parameters control how random an LLM's output is. Temperature is the main dial most people tune. Top-p (nucleus sampling) and top-k are tighter controls you usually leave at default. And in 2026, the newest reasoning models — GPT-5 and Claude Opus 4.8 — are removing these knobs entirely in favor of reasoning-effort controls. Per-task settings, code examples, and the 2026 changes are all below.

What are temperature, top-p, and top-k in AI?

Temperature, top-p, and top-k are the three sampling parameters that control randomness when a large language model picks each next token. Temperature rescales the whole probability distribution (higher means more random). Top-p (nucleus sampling) keeps only the smallest set of tokens whose cumulative probability reaches a threshold. Top-k keeps only the K most-probable tokens. Tune temperature first; leave the others at default.

If you have ever wondered why the same prompt produces a slightly different answer every time you hit send, the answer lives in these parameters. The AI temperature and top-p settings are the difference between a model that returns the exact same JSON on every call and one that writes ten genuinely different ad headlines. Understanding them is one of the highest-leverage skills in prompt engineering, because the right setting can swing accuracy, reliability, and creativity without changing a single word of your prompt.

This guide explains what each parameter does, gives you copy-pasteable defaults per task, walks through real API examples for OpenAI and Anthropic, and covers the big 2026 shift: frontier reasoning models that no longer let you set temperature at all.

What is sampling, and why does it make output random?

When an LLM generates text, it does not simply "know" the next word. At each step, the model produces a probability distribution over its entire vocabulary — every possible next token gets a score. Modern tokenizers are large: GPT-2 and GPT-3 used roughly 50,000 tokens, GPT-4's cl100k tokenizer expanded to around 100,000, and recent models like Llama 3 (128k) and Gemma 3 (262k) go even higher, according to vocabulary-size breakdowns of modern LLMs.

So at every single step, the model is choosing one token out of tens of thousands of candidates. Sampling is the process of picking from that distribution. If the model always picked the single highest-probability token, output would be deterministic — and often dull and repetitive. Instead, it samples, and sampling parameters control how it samples.

Here is the mental model. Imagine the model is about to finish the sentence "The capital of France is ___." The distribution might look like this:

Candidate tokenRaw probability
Paris0.92
the0.03
a0.02
located0.01
... 50,000 more ...tiny

With a factual prompt like this, you almost always want "Paris." But with a creative prompt — "Write the opening line of a noir novel" — a flat distribution where many tokens are plausible is exactly what you want. Sampling parameters let you reshape and trim that distribution to match the job.

What does temperature do in an LLM?

Temperature scales the probability distribution before the model samples from it. A low temperature sharpens the distribution so the model strongly favors the top tokens. A high temperature flattens it, giving unlikely tokens a real chance. Mathematically, the model divides each token's logit by the temperature value before applying softmax — small divisor, sharper peaks; large divisor, flatter curve.

Here is how the common values behave:

  • Temperature 0 — Deterministic. Always picks the most probable token. Same input produces the same output (in theory; see the FAQ on floating-point caveats).
  • Temperature 0.3 — Mostly safe choices with a little variation. Good for code and factual answers.
  • Temperature 0.7 — Balanced randomness. The traditional everyday default.
  • Temperature 1.0 — The full distribution as the model was trained. The default in OpenAI's and Anthropic's classic APIs.
  • Temperature 1.5+ — Amplified randomness. The model reaches into the long tail of unlikely tokens. More creative, less reliable.
  • Temperature 2.0 — Often produces broken, off-topic, or garbled text.

The range differs by provider. OpenAI's Chat Completions API accepts temperature from 0 to 2, with a default of 1.0. Anthropic's classic Messages API accepts 0.0 to 1.0, also defaulting to 1.0, and recommends values closer to 0 for analytical tasks and closer to 1 for creative ones.

Temperature settings by task

Here is the cheat sheet most teams converge on. Treat it as a starting point, not gospel — always test against your own prompts.

TaskRecommended temperature
Structured extraction (JSON from text)0
Classification (sentiment, category, routing)0 – 0.2
Code generation0.2 – 0.4
Factual Q&A0.2 – 0.4
Translation0.3 – 0.5
Summarization0.3 – 0.5
Customer support replies0.5 – 0.7
Marketing copy0.7 – 1.0
Brainstorming1.0 – 1.3
Creative writing / fiction1.0 – 1.5
Bulk variant generation1.0 – 1.5

The practical rule: when there is one right answer, set temperature 0. When variety helps the outcome, raise it. If you are building structured JSON output, temperature 0 paired with a strict schema is the gold standard for parseable, repeatable results.

What is top-p (nucleus sampling)?

Top-p, also called nucleus sampling, restricts which tokens are eligible before sampling. The model sorts tokens by probability and keeps the smallest set whose cumulative probability reaches the threshold P, discarding the long tail. A top-p of 0.9 keeps just enough of the top tokens to cover 90% of the probability mass.

The key difference from top-k: top-p adapts the candidate pool to each step. When the model is confident (one token at 0.95), top-p might keep only that one token. When the model is uncertain (the top 30 tokens each around 3%), top-p keeps a wide set. It is "dynamic" where top-k is "fixed."

How the common values behave:

  • Top-p 0.1 — Only the most probable handful of tokens. Very narrow, near-greedy.
  • Top-p 0.5 — Considers tokens that together account for 50% of probability mass.
  • Top-p 0.9 — A common balanced setting. Trims the unlikely tail.
  • Top-p 0.95 — A frequent default. Keeps almost everything except the most improbable tokens.
  • Top-p 1.0 — Full distribution, no truncation. The default value on OpenAI's API.

Range: 0 to 1.

When to tune top-p

For most users, the honest answer is: rarely. Tuning top-p moves the needle less than tuning temperature. Microsoft's Azure OpenAI guidance and the wider community both echo the provider rule that you should alter temperature or top_p, but not both, since they interact in ways that are hard to reason about together.

Some useful pairings, if you do experiment:

Temperature 0   + top_p 1.0  = pure greedy (most-probable token each step)
Temperature 0.7 + top_p 0.9  = balanced, slightly trimmed tail
Temperature 1.0 + top_p 0.5  = creative but bounded (won't go off the rails)

That last pairing is genuinely useful: a high temperature gives you exploration, while a tight top-p prevents the model from sampling truly bizarre low-probability tokens. It is one of the few cases where tuning both can be justified — you are using top-p as a safety rail on a high-temperature setup.

What is top-k sampling?

Top-k sampling restricts the candidate pool to the K most-probable tokens, regardless of how much probability mass they cover. Where top-p keeps "as many tokens as it takes to reach 90%," top-k keeps "exactly the top 50, no matter what."

How the common values behave:

  • Top-k 1 — Greedy. Always pick the single most-probable token. Equivalent to temperature 0.
  • Top-k 40 — Consider only the top 40 tokens. A common default in open-source models.
  • Top-k 50 — Slightly wider; another frequent default.
  • Top-k 1000+ — Effectively no truncation for most models, since the useful probability mass is concentrated in far fewer tokens.

Range: 1 to vocabulary size.

When to use top-k

Top-k is the parameter you are least likely to touch. It is a coarser, less adaptive version of top-p, and OpenAI's Chat Completions API never exposed it. You will encounter it mainly in:

  • Open-source models (Llama, Qwen, Mistral) where the default top-p is not well tuned for your use case.
  • Reproducing research that specified an exact top-k value.
  • Anthropic's classic API, where top_k is available as a way to remove long-tail, low-probability responses.

For most people in 2026, the entire top-k conversation is academic. Tune temperature, occasionally touch top-p, and leave top-k alone.

How do temperature, top-p, and top-k interact?

They apply in a specific sequence during token selection. Knowing the order helps you predict what happens when more than one is set.

The LLM sampling pipeline runs like this:

  1. The model produces a raw probability distribution over all vocabulary tokens.
  2. Temperature scales the distribution — higher flattens, lower sharpens.
  3. Top-k truncates to the K most-probable tokens. (In Anthropic's pipeline, the top_k filter runs first, discarding all but the K highest-probability tokens and renormalizing.)
  4. Top-p truncates to the nucleus — the smallest set whose cumulative probability reaches P.
  5. The model samples one token from whatever survives.

Here is the same idea as a table, showing what each stage does:

StageParameterEffect on the distribution
1(raw logits)Tens of thousands of candidate tokens
2TemperatureReshapes — flatter or sharper
3Top-kHard cut to K tokens
4Top-pHard cut to the nucleus (cumulative P)
5(sampling)One token chosen at random from survivors

If both top-p and top-k are set, both filters apply and you get whichever is more restrictive at each step. Usually one is enough — and on the newest frontier models, as you will see, you can set none of them.

What are the best default settings for each use case?

Here are the configurations that cover the overwhelming majority of real-world needs. Copy whichever matches your task.

For everyday balanced tasks:

temperature: 0.7
top_p: 0.95        (or just omit — leave at default)
top_k: 50          (only if your model supports it; otherwise omit)

For deterministic output (extraction, classification, structured data):

temperature: 0
top_p: 1.0

For creative variation (marketing, ideation, fiction):

temperature: 1.0
top_p: 0.95

For maximum randomness (rarely used in production):

temperature: 1.5
top_p: 1.0

Notice that you are only ever really moving temperature. That is intentional. For the vast majority of applications, temperature is the only sampling parameter worth tuning, and everything else stays at default.

Real API examples: OpenAI and Anthropic

Concrete code beats abstract advice. Here are four common scenarios with working request shapes. (These use the classic, sampling-enabled models; the next section covers the newest models that reject these parameters.)

Example 1: Extracting entities from an email

You want a deterministic, parseable shape every single time. Temperature 0 plus structured output is the right call.

// Production extraction — determinism is critical
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: "Extract contact details as JSON." },
    { role: "user", content: emailBody },
  ],
  response_format: { type: "json_schema", json_schema: contactSchema },
  temperature: 0, // same input -> same output
});

Example 2: Generating 10 ad variants

Here you want diversity. Crank temperature up and request multiple completions in one call.

// Creative variation desired
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "user", content: "Write a punchy headline for a running shoe." },
  ],
  temperature: 1.0, // variety wanted
  n: 10, // 10 distinct completions in a single request
});

Example 3: Customer support reply

Reliable but not robotic. A middle temperature keeps replies on-brand without sounding copy-pasted.

// Balanced — accurate but human
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: supportPersona },
    { role: "user", content: customerMessage },
  ],
  temperature: 0.5,
});

Example 4: Code refactor on Anthropic's classic API

Code wants correctness with a touch of stylistic freedom. Anthropic's classic models accept temperature and top_p.

// Correctness-first, small stylistic room
const response = await anthropic.messages.create({
  model: "claude-3-7-sonnet",
  max_tokens: 4096,
  messages: [{ role: "user", content: refactorPrompt }],
  temperature: 0.3,
  top_p: 0.95,
});

If you find yourself hand-setting temperature per task every time, that is exactly the kind of repetitive setup a prompt library with per-task presets is meant to eliminate.

The big 2026 shift: models that won't let you set temperature

This is the most important update since this guide was first written, and it changes how you should think about sampling. The newest frontier reasoning models are removing temperature, top-p, and top-k entirely.

OpenAI's GPT-5 reasoning models

GPT-5 reasoning models reject custom sampling values. Try to set temperature to anything other than 1 and the API returns an error like "Unsupported value: 'temperature' does not support 0.2 with this model. Only the default (1) value is supported," as documented across multiple developer reports. The reason is architectural: reasoning models run multiple internal rounds of reasoning, verification, and selection, and forcing a deterministic sampling path breaks that machinery. To steer output, OpenAI introduced reasoning_effort and verbosity as replacements for the old temperature dial.

Anthropic's Claude Opus 4.7 and 4.8

Anthropic went the same direction. Per Anthropic's own documentation, setting temperature, top_p, or top_k to a non-default value returns a 400 error on Claude Opus 4.8, the same as on Claude Opus 4.7. The official guidance is blunt: "Omit these parameters and use prompting to guide the model's behavior." Importantly, the SDK still defines these fields for backward compatibility, so your code type-checks — but the request is rejected server-side at runtime. Instead, Claude Opus 4.8 uses an effort parameter (defaulting to high) and adaptive thinking to control reasoning depth.

Here is what that looks like in practice:

# Old way (Claude Opus 4.6 and earlier) — now a 400 error on 4.8
response = client.messages.create(
    model="claude-opus-4-8",
    messages=[...],
    temperature=0.3,   # rejected
    top_p=0.95,        # rejected
)

# New way (Claude Opus 4.7 and later)
response = client.messages.create(
    model="claude-opus-4-8",
    messages=[...],
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
)

What this means for your mental model

The takeaway is not "temperature is dead." Plenty of models — GPT-4o, Claude's classic Sonnet and Haiku tiers, and essentially the entire open-source ecosystem — still expose temperature, top-p, and top-k. But the frontier is clearly moving toward reasoning-effort controls instead of sampling controls. When you pick a model, check whether it accepts sampling parameters before you build logic that depends on them, and have a fallback path for models that reject them.

Model classTemperature/top-p/top-k?What to use instead
GPT-4o and earlier chat modelsYesTune temperature
GPT-5 reasoning modelsNo (fixed at 1)reasoning_effort, verbosity
Claude 3.x / classic Sonnet, HaikuYesTune temperature
Claude Opus 4.7 / 4.8No (400 error)effort, adaptive thinking, prompting
Open-source (Llama, Qwen, Mistral)Yes, including top-kTune temperature first

Self-consistency: turning temperature into accuracy

One of the best reasons to raise temperature on purpose is a technique called self-consistency, and the numbers behind it are striking.

Self-consistency was introduced by Wang et al. at Google Research (ICLR 2023). Instead of taking a single greedy answer, you sample several diverse reasoning paths at a non-zero temperature, then take the majority vote across them. The diversity comes from temperature — typically around 0.7 — which lets the model explore different chains of reasoning that often converge on the correct answer even when any single chain might slip.

The accuracy gains reported in the original work are large. According to a breakdown of the paper's benchmarks, self-consistency improved:

  • GSM8K (grade-school math) by 17.9%
  • SVAMP (arithmetic word problems) by 11.0%
  • AQuA (algebraic reasoning) by 12.2%
  • StrategyQA (multi-hop reasoning) by 6.4%
  • ARC-challenge (science questions) by 3.9%

Here is the pattern in code — five samples at temperature 0.7, then a majority vote:

// Self-consistency: sample N reasoning paths, take the consensus
const responses = await Promise.all(
  Array.from({ length: 5 }).map(() =>
    openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: chainOfThoughtPrompt }],
      temperature: 0.7, // diversity drives the gains
    })
  )
);

const answers = responses.map((r) => extractAnswer(r));
const consensus = mode(answers); // majority vote

This pairs naturally with chain-of-thought prompting: you elicit explicit reasoning, sample it several times at moderate temperature, and let the majority correct one-off mistakes. For high-stakes math and logic, it is one of the most reliable accuracy boosts you can get without changing models. The obvious trade-off is cost and latency — you are paying for N completions instead of one — so reserve it for the tasks where correctness genuinely matters.

Common mistakes with sampling parameters

Even experienced builders trip over the same handful of issues. Here are the ones worth memorizing.

  1. Setting temperature 0 for creative tasks. Deterministic output is repeatable but bland. Use 0.7-1.0 for marketing copy and 1.0+ for ideation. If your "creative" outputs feel samey, your temperature is probably too low.

  2. Setting temperature 1.5+ in production. Very high temperatures reach into the long tail of unlikely tokens, where output degrades into broken or off-topic text. Cap production temperatures around 1.2 for stability.

  3. Tuning both top-p and temperature. Providers explicitly recommend changing one or the other, not both, because their combined effect is hard to predict. Pick temperature and leave top-p alone unless you have a specific reason.

  4. Forgetting that temperature is free. Temperature does not change cost or model tier — only which token is chosen. Tune it per task, not once per app. There is no budget reason to use the same temperature everywhere.

  5. Shipping random temperature into critical paths. The same prompt at temperature 0.7 returns different output on every call. For operations that must be reproducible — audits, idempotent pipelines, regression tests — use temperature 0 plus a fixed seed where the API supports one.

  6. Assuming temperature works on every model. As covered above, GPT-5 reasoning models and Claude Opus 4.7/4.8 reject non-default sampling values. Code that hardcodes temperature=0.3 will throw a 400 on those models. Branch by model capability.

  7. Treating temperature 0 as perfectly deterministic. It is the most deterministic setting available, but floating-point math on GPUs, mixture-of-experts routing, and hardware load-balancing can still introduce rare variation. Add a seed when you truly need bit-for-bit reproducibility.

What changed across 2025-2026

A quick timeline of how the sampling landscape evolved:

  • Reasoning modes arrived. GPT-5 and Claude Opus reasoning apply (or used to apply) temperature to the visible output, while the internal reasoning trace runs more deterministically regardless of the setting.
  • Frontier models dropped sampling controls. First GPT-5 reasoning models fixed temperature at 1; then Claude Opus 4.7 and 4.8 began rejecting temperature, top_p, and top_k outright. Reasoning-effort and verbosity parameters replaced them.
  • Structured output matured. JSON-schema modes plus temperature 0 became the gold standard for production extraction — guaranteed-parseable output with zero variance.
  • Open-source kept the full toolkit. Llama, Qwen, and Mistral families still expose temperature, top-p, and top-k, so the classic knowledge remains essential when you self-host.
  • Multi-sample APIs made self-consistency cheap to implement. The n parameter lets you request several completions at temperature 0.7 in a single call, making majority-vote reasoning a one-request pattern.

Quick reference card

Save this. It covers roughly 90% of sampling decisions you will ever make on models that still support these parameters.

GoalTemperatureTop-pNotes
Deterministic extraction01.0Pair with structured output
Classification / routing0 – 0.2default
Code (correctness-focused)0.2 – 0.4default
Translation0.3 – 0.5default
Summarization0.3 – 0.5default
Customer support0.5 – 0.7default
Marketing copy0.7 – 1.0default
Brainstorming1.0 – 1.30.9 – 1.0Raise both for variety
Bulk variant generation1.0+1.0Set n=10 to batch
Creative writing1.0 – 1.51.0Cap at 1.5
Self-consistency reasoning0.7defaultGenerate 5x, vote
Reasoning model (GPT-5, Opus 4.8)n/an/aUse effort/verbosity instead

What to do next

You do not need to memorize the math behind softmax to use these parameters well. You need a small set of habits.

  1. Audit your production prompts. If you are using temperature 0.7 for everything, you are probably leaking accuracy on deterministic tasks and starving creative tasks of variety. Right-size per task.
  2. A/B test by temperature. Run the same prompt at 0, 0.5, and 1.0. Note which value suits each job, then standardize.
  3. Implement self-consistency for high-stakes reasoning. Five samples at 0.7 plus a majority vote is a proven accuracy boost for math and logic.
  4. Pair temperature 0 with structured output for extraction. It is the most reliable production pattern for parseable data.
  5. Check model capabilities before you ship. Branch your code so reasoning models that reject sampling parameters fall back to effort and verbosity controls.

Tools that ship per-task temperature presets — like the Prompt Architects Chrome extension and prompt library — save you the manual setup on every call. But the understanding matters more than the tool: once you know what each parameter does, you can get the right behavior from any model, whether it exposes a temperature dial or a reasoning-effort knob.

Frequently asked questions

What's the difference between temperature and top-p? Temperature scales the entire probability distribution before sampling — higher flattens it for more randomness, lower sharpens it toward determinism. Top-p restricts which tokens are eligible by keeping only the smallest set whose cumulative probability reaches P. They control different things and interact unpredictably, so providers recommend tuning one or the other, not both. In practice, tune temperature and leave top-p at default.

What temperature should I use for code generation? Use 0 for tasks with one correct answer (extraction, classification, structured output), 0.2-0.4 for code where you want correctness with some stylistic variation, and 0.7+ for brainstorming and creative work. Most production apps set temperature per task rather than picking one value for the whole application.

Does temperature affect cost? No. Temperature only changes which token the model selects at each step, not how many tokens it generates or which model tier you call. Token count and model pricing drive cost. The one indirect effect is that very high temperatures can produce longer, more rambling output, which adds output tokens.

Why don't all models support top-k? Top-k is most common in open-source models and Anthropic's classic API. OpenAI's Chat Completions API never exposed it. Most users tune temperature plus top-p, so top-k is niche — and newer frontier models are removing all three sampling parameters anyway.

Why can't I set temperature on GPT-5 or Claude Opus 4.8? Modern reasoning models disable sampling. GPT-5 reasoning models only accept temperature=1 and reject other values because their multi-pass internal reasoning breaks under forced determinism. Claude Opus 4.7 and 4.8 return a 400 error if you set temperature, top_p, or top_k to a non-default value. Use reasoning-effort and verbosity controls plus prompting instead.

What is nucleus sampling? Nucleus sampling is another name for top-p sampling. The model sorts tokens by probability, then keeps the smallest set (the nucleus) whose cumulative probability reaches the top-p threshold, discarding the long tail. A top-p of 0.9 keeps just enough top tokens to cover 90% of the probability mass, adapting the candidate pool to each step.

How do I pick the right temperature for production? Start at the default, run your prompt 5 times, and observe. If outputs vary too much, lower toward 0.2-0.3. If they are too repetitive, raise toward 1.0. Production teams usually pick per task: 0 for extraction, 0.5-0.7 for support, 1.0 for creative variation, plus a seed where reproducibility matters.

Is temperature 0 truly deterministic? Almost, but not always. Temperature 0 picks the most-probable token at each step, which is deterministic in theory. In practice, GPU floating-point non-determinism, mixture-of-experts routing, and hardware load-balancing can still cause occasional variation. For strict reproducibility, set temperature 0 plus a fixed seed where the API supports it.


By Nafiul Hasan — Founder of Prompt Architects, builder of prompt-engineering tooling used across ChatGPT, Claude, and Gemini. Last updated: June 10, 2026.

Frequently asked questions

Free Chrome Extension

Stop rewriting prompts. Start shipping.

Works with ChatGPT, Claude, Gemini, Grok, Midjourney, Ideogram, Veo3 & Kling. 5.0★ on the Chrome Web Store.

Create An Account