Back to blog
EngineeringUpdated June 10, 202623 min read

Chain-of-Thought Prompting: Examples and When to Use It (2026)

Chain-of-Thought (CoT) prompting forces step-by-step reasoning. Zero-shot vs few-shot CoT. When it lifts accuracy 70%, when it doesn't. Examples included.

NH
Nafiul Hasan
Founder, Prompt Architects

TL;DR: Chain-of-thought prompting makes a language model show its reasoning steps before answering. It lifts accuracy sharply on multi-step math, logic, and code tasks, but adds little to simple lookups. Two variants: zero-shot ("Let's think step by step") and few-shot (worked examples). Use it when the work has steps; skip it when it doesn't.

What is chain-of-thought prompting?

Chain-of-thought prompting is a prompt engineering technique that forces a large language model to generate intermediate reasoning steps before giving its final answer, which dramatically improves accuracy on multi-step problems. Instead of asking for an answer and getting a guess, you ask the model to reason through the problem out loud, then conclude. The visible reasoning acts as scaffolding the model builds on, and it lets you audit how the answer was reached.

The technique came from a 2022 Google Research paper, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", presented at NeurIPS 2022. The authors showed that supplying a handful of step-by-step worked examples made large models far better at arithmetic, commonsense, and symbolic reasoning. The headline result: prompting a 540-billion-parameter PaLM model with just eight chain-of-thought exemplars reached state-of-the-art accuracy on the GSM8K math benchmark, beating a fine-tuned GPT-3 with a verifier.

What makes chain-of-thought prompting so useful is that it is almost free. You do not fine-tune anything, you do not add tools, and you do not need a special model. You change the wording of the prompt, and a task the model used to fail it now gets right. That is why it became one of the first techniques every prompt engineer learns, and why it still matters in 2026 even as reasoning-native models reshape the landscape.

This guide covers what chain-of-thought prompting is, the two main variants, exactly when it helps versus when it hurts, the mechanism behind it, copy-pasteable templates, advanced variants like self-consistency, and how the technique has changed now that models reason internally by default.

Why does chain-of-thought prompting work?

Large language models generate text one token at a time, and every new token is predicted based on everything written before it, including the model's own output. When you force the model to write out reasoning steps, those steps become part of the context the model reads when it predicts the final answer. In effect, the model uses its own generated reasoning as working memory.

Picture the difference. Asked "A store had 23 apples, sold 15, received 38, then sold 27 — how many remain?", a model answering directly has to compress all that arithmetic into a single prediction. It frequently lands on a plausible-but-wrong number. Asked the same question with room to reason, it writes 23 − 15 = 8, then 8 + 38 = 46, then 46 − 27 = 19, and each intermediate result anchors the next. The hard problem becomes a chain of easy ones.

Three consequences follow from this mechanism, and each one matters in practice.

  • Scale gates the benefit. Chain-of-thought is an emergent ability. The original research found it only produces meaningful gains on models around 100 billion parameters or larger. Below that, models generate fluent but illogical chains and can perform worse than plain prompting.
  • The reasoning need not be flawless. Even chains with small slips often land on the right answer, because partial scaffolding still beats none. This is exactly what self-consistency exploits, as we will see.
  • More steps cost more tokens. Every reasoning token is generated and billed, and adds latency. On genuinely multi-step problems that is a good trade. On trivial ones it is pure overhead.

That last point is the crux of using chain-of-thought prompting well. It is not a universal upgrade you bolt onto every prompt. It is a tool with a specific job, and the skill is knowing when the job is present.

What are the two types of chain-of-thought prompting?

Chain-of-thought prompting comes in two main flavors that differ in how much you hand the model up front: zero-shot CoT, which adds a single trigger phrase, and few-shot CoT, which supplies complete worked examples. Both produce reasoning chains; they differ in cost to write and in accuracy on the hardest tasks.

Zero-shot chain-of-thought

Zero-shot CoT is the simplest possible version. You add a trigger phrase — most famously "Let's think step by step" — and the model produces reasoning without any examples. This came from a second 2022 paper, "Large Language Models are Zero-Shot Reasoners" by Kojima and colleagues, also at NeurIPS 2022.

The result was striking. Adding "Let's think step by step" to InstructGPT (text-davinci-002) raised MultiArith accuracy from 17.7% to 78.7%, and GSM8K from 10.4% to 40.7% — with no examples and the same single phrase across every task. A four-word addition turned a model that mostly failed grade-school math into one that mostly passed.

Here is zero-shot CoT in practice:

Question: A store had 23 apples. They sold 15 in the morning,
then received a shipment of 38, then sold 27 in the afternoon.
How many apples do they have at the end of the day?

Let's think step by step.

Typical output:

Step 1: Start with 23 apples.
Step 2: They sold 15, so 23 - 15 = 8 apples remain.
Step 3: They received 38, so 8 + 38 = 46 apples.
Step 4: They sold 27, so 46 - 27 = 19 apples.

Final answer: 19 apples.

Without the trigger phrase, the same model often jumps straight to an incorrect total.

Few-shot chain-of-thought

Few-shot CoT includes one to three complete examples — each showing both the problem and the full reasoning chain — before your real question. The examples teach the model the shape of reasoning you want: how detailed, in what order, in what format.

Q: A store had 30 apples. Sold 12. Received 20. Sold 15. End-of-day count?
A: Start with 30. After selling 12: 30 - 12 = 18.
   After receiving 20: 18 + 20 = 38.
   After selling 15: 38 - 15 = 23.
   Answer: 23.

Q: A store had 23 apples. They sold 15 in the morning,
   then received a shipment of 38, then sold 27 in the afternoon.
   How many apples at the end of the day?
A:

The model continues the pattern, mirroring the structure of your example. Few-shot CoT was the variant in the original Wei et al. paper, and it tends to be the more accurate choice on hard, highly structured tasks where you want the reasoning to follow a precise template.

Zero-shot vs few-shot: which should you use?

The traditional answer is "few-shot when accuracy matters, zero-shot for everyday speed." That still holds, but the picture has shifted. A 2025 study, "Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot", found that on the strongest current models, well-written zero-shot prompts can match or exceed few-shot ones — partly because poorly chosen examples can actually bias the model toward the wrong pattern.

DimensionZero-shot CoTFew-shot CoT
Effort to writeMinimal — one phraseHigher — craft 1-3 examples
Accuracy on hard tasksStrongOften strongest, if examples are good
Format controlLooseTight — examples set the template
RiskLowBad examples can mislead
Token cost per callLowerHigher (examples consume context)
Best forQuick reasoning, modern modelsHigh-stakes, format-sensitive tasks

The practical rule: start with zero-shot. If the output is right but messy, add a format instruction. If it is wrong or the structure keeps drifting, invest five minutes in two clean few-shot examples. For more on getting examples right, see our guide to few-shot prompting.

When should you use chain-of-thought prompting?

Use chain-of-thought prompting when a task genuinely decomposes into multiple reasoning steps — arithmetic, logical deduction, code tracing, or weighing several factors — and skip it when the task is a single lookup, a direct mapping, or open-ended creative work. The deciding question is simple: does solving this require working through intermediate steps? If yes, CoT helps. If the answer is a single retrieval or a vibe, CoT adds cost without benefit.

The table below maps common task types to whether chain-of-thought prompting earns its keep.

Task typeDoes CoT help?Why
Math word problemsYes, majorMulti-step arithmetic benefits from explicit chains
Logic puzzles & deductionYes, majorPremises → conclusion needs visible scaffolding
Code debuggingYes, majorTracing execution requires step-by-step thinking
Multi-criteria decisionsYesWeighing factors benefits from listing them explicitly
Symbolic / algebraic manipulationYesSubstitution chains need to be made visible
Multi-fact reading comprehensionYesPulling and combining several facts before answering
Planning and schedulingYesOrdering dependent steps benefits from a chain
Single-fact lookupNoNo steps to chain; output just gets verbose
TranslationNoDirect mapping; reasoning adds noise
Creative writingNoReasoning structure constrains useful variation
Simple binary classificationMarginalHelps only when criteria are genuinely nuanced
SummarizationSometimesHelps analytical summaries, not pure compression

A useful mental model: chain-of-thought prompting trades tokens and latency for accuracy on hard reasoning. When the task is hard and multi-step, that trade is a bargain. When the task is easy, you are paying the cost and getting no accuracy back — and on the newest reasoning models, you may even trigger "overthinking," which we cover next. For a broader view of when each technique fits, our prompt engineering techniques overview maps CoT against the rest of the toolkit.

When does chain-of-thought prompting NOT help?

Chain-of-thought is genuinely the wrong tool in several situations, and forcing it can make results worse, not just slower. Knowing the failure modes is as valuable as knowing the wins.

Single-step factual questions. "What is the capital of France?" has no steps to chain. Adding "Let's think step by step" makes the model pad the answer with filler reasoning it did not need. The answer was never in doubt; you have only made it longer.

Creative writing and free-form prose. Reasoning structure constrains the very variation that makes creative output good. Asking a model to "reason step by step" before writing a poem or a story tends to produce stiff, mechanical text. Creativity wants room, not a logic ladder.

Translation and direct mapping. Translating a sentence is a direct transformation, not a deduction. Chain-of-thought injects noise and can introduce errors by "explaining" a mapping that should just happen.

Simple classification. If the categories are obvious, CoT is overhead. It only helps when the criteria are subtle enough that walking through them changes the call.

Small models. As noted, CoT can hurt models below roughly 100B parameters, which generate confident but broken reasoning.

There is also a modern failure mode that did not exist in 2022: overthinking on reasoning models. A 2024–2025 study, "Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs", documented reasoning models burning huge numbers of tokens on trivial problems like "2 + 3 = ?" with no benefit. On simple datasets, researchers found reasoning length could be compressed by more than 60% with no accuracy loss. In other words, on a reasoning-native model, piling on "think harder" instructions for an easy question is actively wasteful.

How do you write a good chain-of-thought prompt?

A strong chain-of-thought prompt does three things: it asks for reasoning explicitly, it separates the reasoning from the final answer, and it constrains the format just enough to be parseable without strangling the reasoning. Here is a checklist before you ship a CoT prompt.

  1. State the task clearly first. CoT cannot rescue an ambiguous question. Pin down the actual ask before adding the reasoning instruction.
  2. Add an explicit reasoning trigger. "Let's think step by step," "Show your reasoning before answering," or "Work through this carefully." Be direct.
  3. Separate reasoning from answer. Ask for the final answer on its own line, ideally with a label like Answer:, so you can extract it reliably.
  4. Match detail to difficulty. For genuinely hard problems, invite thorough reasoning. For medium ones, ask for "brief reasoning" to avoid bloat.
  5. Use few-shot when structure matters. If you need every answer in the same shape, show one or two examples rather than describing the shape in prose.
  6. Validate the chain, not just the answer. A right-sounding answer on top of broken reasoning is unreliable. If the steps are wrong, distrust the conclusion.

Copy-pasteable templates

These are ready to drop into ChatGPT, Claude, or Gemini. Adjust the bracketed parts.

Zero-shot CoT (general purpose):

[Your question]

Let's think step by step. After your reasoning, give the final answer
on a new line starting with "Answer:".

Brief CoT (medium-difficulty tasks, controlled length):

[Your question]

Reason through this in 3-5 short steps, then give the answer.
Keep each step to one sentence.

Few-shot CoT (math and structured problems):

Q: [example problem 1]
A: [step-by-step solution 1]
   Answer: [final answer 1]

Q: [example problem 2]
A: [step-by-step solution 2]
   Answer: [final answer 2]

Q: [your real problem]
A:

Code debugging CoT:

The following code produces [describe the bug]:

[paste code]

Walk through the execution step by step:
1. What does each relevant line do?
2. Where does actual behavior diverge from expected?
3. What is the root cause?
4. What is the fix?

Then provide the corrected code.

Multi-criteria decision CoT:

I need to decide between [option A] and [option B] for [goal].

Reason step by step:
1. List the criteria that matter for this decision.
2. Score each option against each criterion.
3. Weigh the criteria by importance.
4. State your recommendation and the single biggest reason.

If you find yourself rewriting these by hand every session, the Prompt Architects Chrome extension ships CoT structures as one-click presets and saves them to a reusable library so you stop retyping the boilerplate.

What are the advanced chain-of-thought variants?

Once basic CoT clicks, several extensions push accuracy further or handle harder problems. You will not need all of them, but knowing the toolbox lets you reach for the right one.

Self-consistency

Self-consistency is the highest-value upgrade for math and logic. Instead of trusting a single reasoning chain, you sample several chains for the same question (at a non-zero temperature so they differ), then take the majority answer. One-off reasoning slips get outvoted.

The numbers are compelling. The 2022 paper "Self-Consistency Improves Chain of Thought Reasoning in Language Models" reported that self-consistency raised PaLM-540B's GSM8K accuracy from 56.5% to 74.4%, a gain of nearly 18 points over plain CoT. The cost is obvious — you run the prompt several times — so reserve it for high-stakes answers where being wrong is expensive.

[Your problem]

Let's think step by step.

Run that five times, collect the five final answers, and take whichever appears most often.

Tree-of-thought

Tree-of-thought generalizes CoT from a single line of reasoning to a branching tree: the model explores several partial reasoning paths, evaluates them, and pursues the most promising. It shines on problems with search or backtracking, like puzzles and planning. It is powerful but expensive and fiddly, so it lives mostly in research and agent frameworks rather than chat-window use.

Least-to-most prompting

Least-to-most decomposes a complex problem into ordered sub-problems, then solves them in sequence, feeding each answer into the next. It is the cure when plain CoT produces one long, tangled chain that loses the thread halfway through. Break the problem; solve the pieces.

Plan-and-solve

Plan-and-solve splits reasoning into two phases: first the model produces a plan, then it executes that plan. It works well for code generation and structured analysis, where having a stated plan keeps the execution honest and gives you a checkpoint to correct before the model commits.

VariantBest forCostComplexity
Zero-shot CoTEveryday reasoningLowTrivial
Few-shot CoTStructured, format-sensitive tasksMediumLow
Self-consistencyHigh-stakes math & logicHigh (N runs)Low
Tree-of-thoughtSearch, puzzles, planningVery highHigh
Least-to-mostComplex decomposable problemsMediumMedium
Plan-and-solveCode generation, structured analysisMediumMedium

How does chain-of-thought work with reasoning models like GPT-5 and Claude Opus 4?

On reasoning-native models, much of chain-of-thought happens automatically inside the model, so explicit CoT prompting shifts from "make it reason" to "make the reasoning visible, structured, or auditable." The renewed wave of interest in CoT followed OpenAI's o1 preview, which baked step-by-step reasoning into the model itself, and the pattern has since spread across frontier models. IBM notes that interest in CoT surged again precisely because these reasoning-first models put the technique at their core.

So is explicit CoT obsolete? No — its job changed. Here is where it still earns its place on a reasoning model:

  • Auditability. When you need to see the steps — for debugging, teaching, compliance, or trust — asking for visible reasoning still matters, even if the model would have reasoned internally anyway.
  • Specific structure. If you need reasoning to follow a particular order or rubric, you have to say so. Internal reasoning follows the model's preferences, not yours.
  • Controlling effort. On reasoning models, the new skill is often telling the model to reason less on easy questions to avoid overthinking, not more. "Answer directly" is now a legitimate and useful instruction.

And here is where it now backfires: piling extra "think step by step" instructions onto a reasoning model handling a simple question. You pay for reasoning tokens you do not need and add latency for no accuracy. The 2025 overthinking research makes this concrete — on simple problems, the extra reasoning is largely compressible without losing accuracy.

The takeaway: on a classic model, default to adding CoT for hard tasks. On a reasoning model, default to letting it reason internally, and reach for explicit CoT only when you need the steps visible, structured, or deliberately constrained. For a deeper comparison of model behaviors, our guide to prompting different AI models breaks down how ChatGPT, Claude, and Gemini differ in practice.

How do you combine chain-of-thought with prompt frameworks?

Chain-of-thought layers cleanly on top of structured prompt frameworks — you add a reasoning instruction to the framework's action step and place any examples before the main block, getting reasoning quality and consistent tone in one prompt. CoT controls how the model thinks; a framework controls what role, format, and tone it uses. They are complementary, not competing.

Take the CRAFT framework (Context, Role, Action, Format, Tone). To add chain-of-thought, you extend the Action component with a reasoning instruction and use Format to keep reasoning and output cleanly separated:

[CONTEXT] You are reviewing a SaaS pricing change for a B2B product.
[ROLE] Act as a senior pricing analyst.
[ACTION] Recommend whether to move from per-seat to usage-based pricing.
         Walk through your reasoning step by step before the recommendation.
[FORMAT] Put reasoning under a "Reasoning" heading, then the recommendation
         under a "Decision" heading as 3 bullet points.
[TONE] Direct and quantitative.

The framework guarantees the role, structure, and tone; the CoT instruction guarantees the model reasons before it commits. If you need few-shot CoT inside a framework, place the worked examples before the framework block so they prime the pattern without interrupting the structure.

This combination is the workhorse for complex professional tasks: analysis with a required format, recommendations that need visible justification, technical reviews where the reasoning is part of the deliverable. Learn the framework first in our CRAFT prompt framework guide, then layer CoT on once the structure is second nature.

What are common chain-of-thought mistakes to avoid?

Even experienced prompters trip over the same handful of issues. Watch for these.

  1. Adding CoT to everything. The single most common mistake. CoT on simple lookups produces verbose output with zero accuracy gain, and on reasoning models it triggers overthinking. Reserve it for multi-step work.
  2. Using CoT on creative writing. Reasoning structure flattens prose. Let creative tasks breathe; do not make them reason first.
  3. Truncating the chain. If the model runs out of context mid-reasoning, the cut-off output can be worse than no CoT at all. For long chains, use a model with a large context window.
  4. Trusting the answer when the chain is wrong. A confident, correct-sounding conclusion on top of flawed steps is unreliable. Read the reasoning; if it is broken, distrust the answer regardless of how it sounds.
  5. Skipping few-shot when accuracy is critical. Zero-shot works; for high-stakes, format-sensitive tasks, two clean examples often work better. Spend the five minutes.
  6. Picking misleading few-shot examples. Bad examples bias the model toward bad patterns — which is exactly why 2025 research found zero-shot sometimes beats few-shot. If your examples are not clearly representative, zero-shot may be safer.
  7. Forgetting to label the final answer. Without an explicit "Answer:" line, extracting the conclusion from a long chain is brittle, especially in automated pipelines.

How is chain-of-thought used in production AI systems?

In production, chain-of-thought is usually hidden, logged, or validated rather than shown raw to end users, because exposing every reasoning token is verbose and occasionally reveals internal logic teams prefer to keep private. The pattern differs from chat-window prompting, where you usually want the reasoning visible.

Three common production patterns:

  • Hidden CoT. The model reasons internally and returns only the final answer. This is the default on reasoning models, which generate their chains behind the scenes and surface a clean response.
  • Validated CoT. The chain is generated, then parsed and checked against a rubric or a verifier before the answer is shown. If the reasoning fails the check, the system retries or escalates. Self-consistency is a lightweight form of this — generate several chains and vote.
  • Logged CoT. The chain is stored for debugging, auditing, or analytics but stripped from the user-facing response. This gives engineers a trail to inspect when an answer goes wrong, without cluttering the product.

For everyday manual prompting in a chat window, explicit chain-of-thought remains the highest-leverage single technique for math, code, and any genuinely multi-step problem. You do not need infrastructure to benefit — you need to recognize when a task has steps and ask the model to walk through them.

Chain-of-thought prompting: the bottom line

Chain-of-thought prompting earned its place as a foundational technique because it does something rare in prompt engineering: it reliably converts a class of failures into successes, for free. The 2022 research that introduced it showed zero-shot CoT lifting GSM8K accuracy from 10.4% to 40.7% and self-consistency pushing PaLM to 74.4% — gains you simply could not buy with better wording alone before this idea existed.

In 2026, the technique is both essential and more nuanced. Reasoning-native models reason by default, so the question is no longer "should I add CoT to make it think?" but "do I need the thinking visible, structured, or restrained?" The judgment that separates good prompters from great ones is the same as it was in 2022: recognizing which tasks have steps worth chaining, and which do not.

The fastest way to build that intuition is to run the same set of prompts both ways — with chain-of-thought and without — and note where it pays off. Do that across fifty real prompts and you will internalize the boundary. Then chain it with structured output and a framework like CRAFT, and you have the two or three highest-leverage techniques in modern prompt engineering working together. Tools like Prompt Architects ship these structures as one-click presets so you spend your time on the judgment, not the boilerplate.

Frequently asked questions

What is chain-of-thought prompting? Chain-of-thought (CoT) prompting is a technique that forces a large language model to generate intermediate reasoning steps before its final answer. Two variants exist: zero-shot CoT adds a trigger phrase like "Let's think step by step," and few-shot CoT supplies 1-3 worked examples. Both raise accuracy on multi-step math, logic, and code tasks.

When does chain-of-thought help and when doesn't it? CoT helps on tasks with multiple reasoning steps: math word problems, logical deduction, code debugging, and multi-criteria decisions. It adds little to single-fact lookups, translation, simple classification, or creative writing. On very easy questions it can cause "overthinking," producing slower, longer answers with no accuracy gain.

Is zero-shot or few-shot chain-of-thought better? Few-shot CoT usually wins on hard, structured tasks when you have representative worked examples. Zero-shot CoT is faster to write and works well on modern models. Recent 2025 research shows zero-shot can match or beat few-shot on the strongest reasoning models, so test both for your workload.

Does chain-of-thought still work on reasoning models like GPT-5 and Claude Opus 4? Yes, but the gains are smaller. Reasoning models already produce internal chains of thought, so explicit CoT mainly helps when you need visible, auditable steps or a specific reasoning structure. On these models, prompting for endless reasoning on simple questions wastes tokens and latency.

How much does chain-of-thought improve accuracy? In the original 2022 research, zero-shot CoT raised InstructGPT's MultiArith accuracy from 17.7% to 78.7% and GSM8K from 10.4% to 40.7%. Adding self-consistency lifted PaLM-540B on GSM8K from 56.5% to 74.4%. Gains depend heavily on model size and task difficulty.

What is self-consistency in chain-of-thought prompting? Self-consistency samples several independent reasoning chains for the same question, then takes the majority answer instead of trusting a single chain. It cancels out one-off reasoning errors and reliably improves accuracy on math and logic, at the cost of running the prompt multiple times.

Does chain-of-thought work on small language models? Not well. The original Wei et al. study found CoT only produces meaningful gains on models around 100 billion parameters or larger. Smaller models generate fluent but illogical chains, which can make them perform worse than plain prompting.

Can I combine chain-of-thought with other prompt frameworks? Yes. CoT layers cleanly on top of structured frameworks like CRAFT. Add a reasoning instruction such as "Show your reasoning step by step" to the action component, and place any few-shot examples before the main block. This pairs reasoning quality with consistent tone and format.

By Nafiul Hasan — Founder of Prompt Architects, where he builds tooling that turns plain prompts into model-optimized instructions for ChatGPT, Claude, and Gemini. Last updated: June 10, 2026.

Frequently asked questions

Free Chrome Extension

Stop rewriting prompts. Start shipping.

Works with ChatGPT, Claude, Gemini, Grok, Midjourney, Ideogram, Veo3 & Kling. 5.0★ on the Chrome Web Store.

Create An Account