TL;DR: Zero-shot prompting uses no examples and is faster to write. Few-shot prompting adds 2-5 examples and is more accurate for nuanced or custom-format tasks. Pick by task type, not by habit. Use the decision tree, tables, and copy-paste templates below to choose in seconds.
What is the difference between few-shot and zero-shot prompting?
Few-shot vs zero-shot prompting comes down to one thing: examples. Zero-shot prompting gives the model only an instruction and asks it to perform the task with no demonstrations. Few-shot prompting includes a handful of input-output examples (typically 2-5) before your real input, so the model infers the pattern from those examples. Few-shot usually wins on accuracy and output consistency; zero-shot is faster and cheaper.
That single distinction — examples or no examples — drives nearly every practical decision about which to use. The rest of this guide shows you exactly when each one pays off, backed by the original research that established these techniques, plus templates you can paste into ChatGPT, Claude, or Gemini right now.
Both techniques are forms of in-context learning: the model adapts its behavior based on what is in the prompt, without any retraining or fine-tuning. This idea was formalized in the GPT-3 paper, Language Models are Few-Shot Learners (Brown et al., 2020), which showed that a large enough model could perform new tasks just from instructions and examples placed in the context window. That paper is the reason few-shot prompting became the default way people interact with large language models.
What is zero-shot prompting?
Zero-shot prompting asks a model to perform a task using only the instruction itself — no examples. You describe what you want, and the model produces it using knowledge baked in during pretraining.
Here is a clean zero-shot example:
Classify this review as positive, negative, or neutral:
"Decent product but shipping was slow."
The model already understands "positive / negative / neutral classification" from its training data, so it maps your instruction onto that knowledge and returns:
neutral
Zero-shot works because frontier models have seen millions of examples of common tasks during pretraining. Translation, summarization, sentiment analysis, basic question answering, code explanation — these are all so well represented in training data that the model rarely needs to be shown what "good" looks like. You just ask.
When zero-shot shines
- Well-known, generic tasks. Translate this paragraph. Summarize this article in three bullets. Fix the grammar.
- Exploratory and creative work. When you do not yet know the right format, examples would only box the model in.
- Single-shot questions. One-off lookups, explanations, or rewrites where consistency across runs does not matter.
- Speed of authoring. Zero-shot prompts are faster to write — there is no example-gathering step.
The trade-off is predictability. Zero-shot output can drift in format, tone, or edge-case handling between runs, because nothing in the prompt pins those down.
What is few-shot prompting?
Few-shot prompting includes a small set of input-output examples (the "shots") in the prompt before your real input. The model reads the examples, recognizes the input-output relationship, and applies that same pattern to your input. No weights change — this is pure in-context learning.
Here is the same sentiment task, rewritten as a few-shot prompt:
Q: "I love this product!"
A: positive
Q: "Worst purchase ever."
A: negative
Q: "It's okay, nothing special."
A: neutral
Q: "Decent product but shipping was slow."
A:
Output: neutral — the same answer, but now far more reliable on borderline cases, because the examples have taught the model exactly how you want ambiguity resolved and exactly what the output should look like (a single lowercase word, not a paragraph).
That last point is underrated. Few-shot examples do not just teach the answer — they teach the format, the label space, and the input style the model should expect. A landmark study, Rethinking the Role of Demonstrations (Min et al., 2022), found that even when researchers randomly replaced the labels in few-shot examples, classification performance barely dropped across 12 models including GPT-3. The examples were doing their work by showing the label space, the input distribution, and the output format — not by spelling out the "right" answers.
The practical takeaway: in few-shot prompting, the shape and variety of your examples often matter more than getting every single label perfect. (For production work, you should still use correct labels — see the caveats later.)
Few-shot, one-shot, and zero-shot in one table
| Technique | Examples in prompt | Best for | Main cost |
|---|---|---|---|
| Zero-shot | 0 | Generic tasks, exploration, speed | Format/output can drift |
| One-shot | 1 | Locking format without over-constraining | One example may not cover edge cases |
| Few-shot | 2-5 | Custom formats, nuanced labels, consistency | More tokens, time to build examples |
| Many-shot | 6+ | Rare; very nuanced tasks with long context | Token bloat, harder pattern matching |
Few-shot vs zero-shot: which is more accurate?
In most measured comparisons, few-shot prompting is more accurate than zero-shot, and the gap widens on harder or more custom tasks. The original GPT-3 work showed that while both zero-shot and few-shot performance improved as models scaled, few-shot performance climbed faster — larger models are simply better at learning from in-context examples (Brown et al., 2020).
The most dramatic evidence comes from reasoning tasks. In Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022), researchers gave a 540B-parameter model just eight few-shot examples that included step-by-step reasoning. That single change pushed the model to state-of-the-art accuracy on the GSM8K grade-school math benchmark, surpassing a fine-tuned GPT-3 with a verifier. The examples did not just show answers — they showed the reasoning process, and the model copied it.
But accuracy is not a free lunch. Few-shot examples consume tokens, and token cost scales linearly with each example you add. Research and field reports converge on the same curve: the biggest accuracy jump usually comes from the first one or two examples, with diminishing returns after four or five. That is why "more examples is always better" is a myth — quality and diversity beat raw count.
| Task | Zero-shot quality | Few-shot quality | Verdict |
|---|---|---|---|
| Standard summarization | High | Marginally higher | Zero-shot fine |
| Common-language translation | High | Marginally higher | Zero-shot fine |
| Custom classification (your own labels) | Low-medium | High | Few-shot |
| Brand-voice writing | Medium (drifts) | High | Few-shot |
| Structured extraction (custom JSON) | Medium | High | Few-shot |
| Math / multi-step reasoning | Medium | High (with CoT) | Few-shot CoT |
| Open-ended creative writing | High | Lower (constrained) | Zero-shot |
The pattern is consistent: zero-shot is enough when the task lives squarely inside the model's pretraining; few-shot earns its keep when the task is yours — your categories, your format, your voice.
When should you use few-shot prompting?
Reach for few-shot when one or more of these is true:
- The output style is easier to show than to describe. Brand voice is the classic case. You can write a 200-word style guide, or you can show three before/after rewrites and let the model absorb the voice instantly.
- The task has a custom format the model has never seen. Your internal ticket schema, a specific JSON shape, a particular table layout. Examples lock it in.
- Classification uses nuanced or proprietary categories. "Billing-dispute vs refund-request vs chargeback" is not in pretraining the way "positive vs negative" is. Examples teach your taxonomy.
- You need the same output shape across many runs. Bulk content generation, batch extraction, and agent steps all depend on consistency. Few-shot is the cheapest way to buy it.
- Reasoning quality matters and the path is non-obvious. Show worked examples (chain-of-thought few-shot) so the model imitates the reasoning, not just the format.
If your task hits two or more of these, write the examples. The five minutes you spend usually pays back many times over in reduced rework.
When is zero-shot prompting enough?
Default to zero-shot when:
- The task is well known and generic (translate, summarize, rephrase, explain).
- You are exploring and do not want examples to narrow the model's range.
- You are doing one-off work where run-to-run consistency does not matter.
- You are using a frontier model (GPT-5, Claude Opus 4.x, Gemini 2.x) for an everyday chat task that those models handle confidently without help.
On the strongest 2026 models, the zero-shot floor has risen a lot. Many tasks that needed examples in the GPT-3 era now work fine with a clear instruction. So a good rule of thumb is: try zero-shot first, and only add examples when the output drifts, breaks format, or misses your edge cases. Do not pay the example tax until you need to.
Decision tree: few-shot or zero-shot?
Use this quick decision flow. Stop at the first "yes."
- Is the task generic and well-known (translate, summarize, simple Q&A)? → Zero-shot.
- Is it open-ended creative work where you want range? → Zero-shot (or one example as a soft style anchor).
- Does the output need a specific custom format or schema? → Few-shot.
- Are the categories or labels your own (not universal)? → Few-shot.
- Do you need consistent output shape across many runs? → Few-shot.
- Is it multi-step reasoning where the path matters? → Few-shot with chain-of-thought.
- None of the above, and you just want speed? → Zero-shot first, escalate to few-shot only if it fails.
| Task type | Few-shot wins? | Why |
|---|---|---|
| Custom classification (your own categories) | Yes | Categories aren't in pretraining; examples teach them |
| Brand-voice content generation | Yes | Voice is easier to show than describe |
| Structured extraction (custom format) | Yes | Examples lock the output shape |
| Translation between specific tones | Yes | Tone variations rarely have universal labels |
| Standard summarization | No (zero-shot) | Pretraining covers summarization patterns well |
| Simple positive/negative sentiment | Marginal | Pretraining handles binary cases; few-shot helps on nuance |
| Code generation from spec | Optional | Frontier models do well zero-shot; few-shot helps with house style |
| Open-ended creative writing | No (zero-shot) | Examples constrain creative range |
| Math word problems | Yes (CoT few-shot) | Showing reasoning chains lifts accuracy substantially |
| Translation (common language pair) | No (zero-shot) | Pretraining covers it |
| Translation (specific glossary/terminology) | Yes | Examples teach the glossary |
If you want a deeper framework for structuring any prompt — examples or not — see our guide to prompt engineering best practices.
How do you write good few-shot examples?
Bad examples hurt more than no examples. A messy or unrepresentative example set actively pulls the model in the wrong direction. Follow these rules and your few-shot prompts will outperform almost anything you write off the cuff.
Rule 1: Cover the variation space
If your real input could appear in several shapes, your examples should span those shapes. A sentiment classifier shown only strongly positive and strongly negative examples will fumble neutral, sarcastic, or mixed input. Pick examples that map the boundaries of the task, not just the easy middle.
Review: "Absolutely perfect, exceeded expectations." → positive
Review: "Broke on day one. Avoid." → negative
Review: "Works, but the app keeps crashing." → mixed
Review: "Arrived. Haven't tried it yet." → neutral
Those four examples teach the model that "mixed" and "neutral" are distinct — something a two-example prompt would never convey.
Rule 2: Match your real input's register
If your real prompts will be casual, your examples should be casual. If your inputs are messy customer emails full of typos, do not feed the model clean, perfectly formatted examples. Min et al. found that the input distribution shown in demonstrations is one of the things models actually rely on (Min et al., 2022). A register mismatch — clean examples, messy real input — quietly degrades quality.
Rule 3: Keep the format identical across examples
If example one ends with → positive and example two ends with Output: positive, you have just taught the model that two formats are acceptable. It may pick either, or blend them. Use one delimiter, one structure, one label style — everywhere.
Rule 4: Order matters (recency bias)
Models tend to weight examples nearer the end of the prompt more heavily. Put your most representative or most important example last, immediately before the real input. If one class is being under-predicted, moving an example of that class to the end often nudges the balance.
Rule 5: Use correct labels for production work
Yes, Min et al. showed random labels barely hurt classification accuracy. But that finding is about a specific setup, and on harder generative or reasoning tasks, wrong labels can genuinely mislead the model. For anything shipping to users, treat correct labels as the default and only relax that for quick experiments. Showing the label space is necessary; showing it correctly is the safe choice.
Copy-paste few-shot templates
Here are battle-tested templates for the four most common few-shot use cases. Swap the bracketed parts for your content.
Custom classification
Classify each [input type] into exactly one of: [category 1], [category 2], [category 3].
Reply with only the category name.
[input 1] → [category 1]
[input 2] → [category 2]
[input 3] → [category 3]
[input 4] → [category 1]
[your real input] →
Brand-voice generation
Voice attributes: [list 5-7 voice traits, e.g. warm, direct, no jargon, short sentences].
Input: [generic prompt 1]
Output (in our voice): [brand-voice rewrite 1]
Input: [generic prompt 2]
Output (in our voice): [brand-voice rewrite 2]
Input: [generic prompt 3]
Output (in our voice): [brand-voice rewrite 3]
Input: [your real input]
Output (in our voice):
Structured extraction (custom JSON)
Extract entities from each text.
Output JSON matching exactly: { "name": "string", "company": "string", "topic": "string" }.
Text: "Sarah Chen, CTO at Acme, called about the Q3 launch."
Output: { "name": "Sarah Chen", "company": "Acme", "topic": "Q3 launch" }
Text: "Meeting with Mike from Globex re: pricing."
Output: { "name": "Mike", "company": "Globex", "topic": "pricing" }
Text: "[your real input]"
Output:
Chain-of-thought few-shot (reasoning / math)
Q: A store had 30 apples. Sold 12. Received 20. Sold 15. End-of-day count?
A: Start: 30. After selling 12: 30 - 12 = 18. After receiving 20: 18 + 20 = 38.
After selling 15: 38 - 15 = 23. Answer: 23.
Q: A store had 23 apples. Sold 15. Received 38. Sold 27. End-of-day count?
A:
The chain-of-thought version is the one that moved benchmarks in Wei et al., 2022. Notice the example does not just give the answer — it walks the arithmetic out loud, which is exactly what you want the model to imitate. If you build these patterns often, our prompt template library guide shows how to save and reuse them so you are not rewriting examples every time.
What is one-shot prompting, and when should you use it?
One-shot prompting sits between zero-shot and few-shot: you give the model exactly one example. It is the right call when:
- You have one strong, representative example and adding more would only dilute it.
- The task is straightforward but the output format needs locking — one example pins the shape.
- You want to anchor tone or style without over-constraining the model's content range.
One-shot is especially good for creative-adjacent work. A single "style anchor" example gives the model a target for voice or structure while still leaving it room to be inventive. Two or three examples in the same scenario would start to flatten the output toward imitation.
Write a product tagline in this style:
Example — Product: noise-cancelling headphones
Tagline: "Silence, on demand."
Now — Product: [your product]
Tagline:
Common few-shot mistakes to avoid
- Too many examples (more than 5). Beyond five, the model has trouble identifying which example matches your input, and token cost climbs with no accuracy payoff. There is more signal in 3 well-chosen examples than 8 generic ones.
- Examples that don't match the real input's shape. Clean, polished examples followed by a messy real input causes a pattern mismatch; the model adjusts toward the cleaner examples and mishandles the mess.
- Inconsistent format across examples. Mixed delimiters and label styles confuse the model about what you actually want. Pick one structure and repeat it exactly.
- Skipping few-shot when accuracy matters. Production pipelines almost always benefit from examples. Zero-shot saves authoring time but pays it back in rework and edge-case failures.
- Few-shot for open-ended creative writing. Examples kill the creative range that makes the task valuable. Use zero-shot, or a single soft style anchor, instead.
- Assuming labels don't matter because of one study. The Min et al. result is real but narrow. For shipping work, use correct labels.
Few-shot vs zero-shot on GPT-5 and Claude Opus 4.x
A fair question in 2026: with models this strong, does few-shot still matter? For casual chat, often no — for production AI, absolutely yes.
Frontier models have raised the zero-shot floor dramatically. Tasks that demanded examples in 2020 now work from a plain instruction. If you are typing into a chat window to draft an email or summarize a doc, zero-shot is almost always enough, and reaching for examples just slows you down.
But production systems are a different story. The places few-shot still wins, even on the best models:
- Consistent output shape across thousands of runs. Examples remain the cheapest, most reliable way to enforce a format at scale — more robust than instructions alone.
- Brand-voice matching. Voice is still easier to show than to specify, no matter how capable the model.
- Custom and proprietary classifications. Your taxonomy is not in pretraining; examples teach it.
- Structured-output and tool-use pipelines. In agents and function-calling workflows, examples reinforce schema adherence and teach the model when to call which tool.
- Reasoning under specific house conventions. Chain-of-thought examples align the model's reasoning style with how your team works the problem.
A useful mental model: zero-shot is for the chat window; few-shot is for the pipeline. The more times a prompt will run unattended, the more a few well-chosen examples are worth.
If you are deciding between examples and a fundamentally different approach, our comparison of prompting vs fine-tuning covers when in-context learning stops being enough and retraining starts to pay off.
How few-shot fits into production AI workflows
For real systems — retrieval-augmented generation, agents, batch extraction — few-shot is the workhorse, and it combines well with other techniques:
- Tool use / function calling. A couple of examples showing when and how to call a tool dramatically improves reliability over instructions alone.
- Structured-output mode. Even when you constrain output with a schema, examples reinforce edge-case handling the schema cannot express.
- Self-consistency. Run the same few-shot prompt several times at a moderate temperature and take the majority answer. This is the Self-Consistency method (Wang et al., 2022), which improved chain-of-thought reasoning accuracy by sampling multiple reasoning paths and voting on the most common answer.
These stack. A production extraction step might use few-shot examples and structured output and self-consistency together, each covering a different failure mode.
A quick comparison of the techniques in this guide
| Technique | Examples needed | Primary benefit | Where it lives |
|---|---|---|---|
| Zero-shot | None | Speed, flexibility | Chat, exploration |
| One-shot | One | Format anchor, light tone control | Creative, quick locking |
| Few-shot | 2-5 | Accuracy, consistency, custom formats | Production, classification, extraction |
| Few-shot + CoT | 2-8 with reasoning | Reasoning accuracy | Math, logic, multi-step tasks |
| Self-consistency | Few-shot + sampling | Robustness via voting | High-stakes reasoning |
A practical workflow to choose between them
You do not need to memorize the theory. Run this loop on your real prompts:
- Take your top 3 daily or production prompts. Run each one zero-shot, then again with 2-3 examples. Compare the outputs side by side.
- Note where examples produced a real lift — usually custom formats, your own categories, or voice. Those prompts become permanent few-shot templates.
- Leave the rest zero-shot. If examples did not improve anything, do not pay the token and maintenance cost.
- Save the winning few-shot patterns somewhere reusable. Any prompt manager handles this; Prompt Architects ships them as one-click presets and lets you store reusable example sets with Global Variables, so you are not re-typing demonstrations every session.
- For classification and structured tasks, default to few-shot. The handful of minutes spent building examples typically pays back many times over in reduced rework.
The real skill is not "few-shot vs zero-shot" in the abstract — it is knowing which three examples to pick. That judgment comes from running prompts both ways and watching where the demonstrations actually moved the needle. After fifty prompts, you will feel it.
Frequently asked questions
What's the difference between few-shot and zero-shot prompting? Zero-shot prompting asks the model to perform a task using only the instruction — no examples. Few-shot prompting includes 2-5 input-output examples before your real input, letting the model infer the pattern from demonstrations. Few-shot generally produces higher accuracy on nuanced or custom-format tasks; zero-shot is faster to write and uses fewer tokens.
When should I use few-shot prompting? Use few-shot when output style is hard to describe but easy to show (brand voice), the task has a specific custom format (your internal data shape), classification has nuanced categories (your own support-ticket labels), or you need consistent output shape across many runs (bulk content, structured extraction, agents).
When is zero-shot prompting enough? Zero-shot works for well-known tasks like translation, summarization, and simple classification; for exploratory work where examples would constrain creativity; for single-shot Q&A; and for most everyday prompting on frontier models like GPT-5 and Claude Opus 4.x, which handle generic instructions well without demonstrations.
How many examples is optimal for few-shot prompting? Two to five examples covers most tasks. The largest jump usually comes from the first one or two examples, with diminishing returns after four or five while token cost keeps rising. For nuanced tasks, 3-5 carefully chosen, diverse examples beat 8 generic ones.
Do the labels in few-shot examples have to be correct? Often, surprisingly, no. Min et al. (2022) found that randomly replacing the labels in demonstrations barely hurt performance across 12 models, including GPT-3 — what matters most is showing the label space, the input distribution, and the output format. Still, for production work you should use correct labels, since wrong labels can hurt on harder, generative tasks.
Does few-shot still matter on GPT-5 and Claude Opus 4? Yes, for production AI. Frontier models handle zero-shot well for general tasks, but few-shot still wins on consistent output shape across runs, brand-voice matching, custom classifications, and schema adherence in structured-output and agent pipelines. For chat-window everyday use, zero-shot is usually enough.
What is one-shot prompting? One-shot prompting gives the model exactly one example before your real input. It sits between zero-shot and few-shot. Use it when you have one strong representative example, the task is straightforward but the output format needs locking, or you want to anchor tone without over-constraining the model's range.
Is chain-of-thought few-shot or zero-shot? It can be either. Few-shot chain-of-thought includes worked examples that show step-by-step reasoning, which is how Wei et al. (2022) lifted reasoning benchmarks. Zero-shot chain-of-thought just appends an instruction like "Let's think step by step" with no examples. Reasoning-heavy tasks benefit from showing the reasoning chain, not just the answer.
By Nafiul Hasan — Founder of Prompt Architects, builder of prompt-optimization tooling used across ChatGPT, Claude, and Gemini, writing from hands-on testing of thousands of production prompts. Last updated: June 10, 2026.