title: "Chain-of-Thought Prompting: Examples and When to Use It (2026)" slug: "43-chain-of-thought-prompting" description: "Chain-of-Thought (CoT) prompting forces step-by-step reasoning. Zero-shot vs few-shot CoT. When it lifts accuracy 70%, when it doesn't. Examples included." publishedAt: "2026-05-26" updatedAt: "2026-05-26" postNum: 43 pillar: 5 targetKeyword: "chain of thought prompting" keywords:
- "chain of thought prompting"
- "chain-of-thought"
- "cot prompting"
- "step by step prompts"
- "reasoning prompts" ogImage: "https://prompt-architects.com/og/43-chain-of-thought-prompting.png" author: name: "Nafiul Hasan" role: "Founder, Prompt Architects" url: "https://prompt-architects.com/about" ctaFeature: "generator" related: [41, 1, 6] faq:
- q: "What is chain-of-thought prompting?" a: "Chain-of-Thought (CoT) is a prompt engineering technique that forces an LLM to explain its reasoning steps before producing an answer. Two flavors exist. Zero-shot CoT adds 'Let's think step by step' to any prompt. Few-shot CoT includes 1-3 worked examples showing the reasoning chain. Both lift accuracy on multi-step problems significantly — typically 30-70% on math, logic, and code reasoning."
- q: "When does chain-of-thought help and when doesn't it?" a: "CoT helps when the task has multiple reasoning steps — math word problems, logical deduction, code debugging, multi-criteria decisions. CoT does NOT help much on simple factual lookups, single-step questions, creative writing, or summarization. Adding CoT to tasks where it doesn't help can produce verbose, slower output without quality gain."
- q: "Is zero-shot or few-shot chain-of-thought better?" a: "Few-shot CoT generally produces higher accuracy when you have 1-3 representative examples. Zero-shot CoT is faster to write and works on most modern LLMs (GPT-4o, Claude Opus 4, Gemini 2.5). Use few-shot when accuracy matters more than speed; zero-shot for everyday reasoning tasks."
- q: "Does chain-of-thought work on GPT-5 / Claude Opus 4?" a: "Yes, but the effect is smaller than on older models. GPT-5 and Claude Opus 4 internally reason step-by-step on complex tasks even without explicit CoT prompting. CoT still helps when you need the model to show the work (auditability, debugging, education) or when reasoning steps need specific structure."
- q: "Can I use chain-of-thought with other frameworks?" a: "Yes. CRAFT + CoT is the standard combination for complex reasoning with style requirements. Add 'Let's think step by step' or 'Show your reasoning' to the Action component of CRAFT. Few-shot examples can be inserted before the CRAFT block."
TL;DR: Chain-of-Thought (CoT) forces the LLM to show reasoning steps before answering. Lifts accuracy 30-70% on multi-step problems. Two variants: zero-shot ("Let's think step by step") and few-shot (worked examples). Wrong tool for simple lookups.
What is Chain-of-Thought prompting?
Chain-of-Thought (CoT) prompting is a technique that forces a language model to break down its reasoning into intermediate steps before producing the final answer. Instead of asking "what's the answer?" and getting a guess, you ask "walk me through the reasoning, then give the answer" and get verifiable logic.
The breakthrough finding (Wei et al., NeurIPS 2022): for multi-step problems, models perform dramatically better when prompted to reason step-by-step — not because they "think harder," but because the act of generating intermediate tokens creates a scaffold the model can build on.
The two CoT variants
Zero-shot CoT
Add a single phrase to any prompt: Let's think step by step.
Question: A store had 23 apples. They sold 15 in the morning,
then received a shipment of 38, then sold 27 in the afternoon.
How many apples do they have at end of day?
Let's think step by step.
Output:
Step 1: Start with 23 apples.
Step 2: Sold 15. Now: 23 - 15 = 8.
Step 3: Received 38. Now: 8 + 38 = 46.
Step 4: Sold 27. Now: 46 - 27 = 19.
Final answer: 19 apples.
Without CoT, models often skip directly to a wrong number.
Few-shot CoT
Include 1-3 worked examples showing the reasoning chain before asking your question.
Q: A store had 30 apples. Sold 12. Received 20. Sold 15. End-of-day count?
A: Start: 30. After selling 12: 30-12=18. After receiving 20: 18+20=38.
After selling 15: 38-15=23. Answer: 23.
Q: A store had 23 apples. They sold 15 in the morning,
then received a shipment of 38, then sold 27 in the afternoon.
How many apples at end of day?
A:
Few-shot CoT outperforms zero-shot when you can supply representative examples — typically 5-15% higher accuracy on hard problems.
When CoT helps (and when it doesn't)
| Feature | Task type | CoT helps? | Why |
|---|---|---|---|
| Math word problems | Math | ✅ Major | Multi-step arithmetic benefits from explicit chains |
| Logic puzzles | Logic | ✅ Major | Premises → conclusion needs scaffolding |
| Code debugging | Debug | ✅ Major | Tracing execution path requires steps |
| Multi-criteria decisions | Decide | ✅ Yes | Weighing factors benefits from explicit listing |
| Symbolic reasoning | Symbol | ✅ Yes | Substitution chains need to be visible |
| Reading comprehension (multi-fact) | Read | ✅ Yes | Pulling multiple facts before answering |
| Single fact lookup | Fact | ❌ No | No steps to chain — output gets verbose for nothing |
| Creative writing | Write | ❌ No | Reasoning structure constrains creative range |
| Summarization | Sum | ⚠️ Sometimes | Helps for analytical summaries; not for compression |
| Translation | Trans | ❌ No | Direct mapping; CoT adds noise |
| Classification (binary) | Class | ⚠️ Marginal | Helps when criteria are nuanced; otherwise overhead |
How CoT works (the mechanism)
LLMs predict one token at a time, conditioned on everything generated so far. When you force the model to generate intermediate reasoning tokens, those tokens become part of the context for predicting the answer token. Effectively, you're using the model's own output as a working memory.
Two implications:
- Bigger models benefit more. CoT lifts on small models (~1B params) are minimal. On frontier models (GPT-4o, Claude Opus 4, Gemini 2.5), gains are large.
- The reasoning doesn't need to be perfect. Even with errors in the chain, having some explicit reasoning beats no scaffolding. Self-consistency prompting takes advantage of this — generate 5 chains, take majority answer.
CoT variants worth knowing
Self-consistency CoT
Run zero-shot CoT 5+ times at temperature 0.7-1.0. Take the majority answer across runs. Eliminates one-off reasoning errors. Standard for high-stakes math/logic.
Tree-of-Thought (ToT)
Explore multiple reasoning paths in parallel, evaluate each, pick best. Used in research; less practical for everyday use due to cost.
Least-to-Most prompting
Decompose a complex problem into sub-problems, solve each in order. Useful when CoT alone produces long messy chains.
Plan-and-Solve
Two-phase: first prompt asks for a plan, second prompt executes the plan. Good for code generation and structured analysis.
Templates you can copy
Zero-shot CoT (general)
[Your question]
Let's think step by step. After your reasoning, give the final answer
on a new line starting with "Answer:".
Few-shot CoT (math)
Q: [example problem 1]
A: [step-by-step solution 1]
Answer: [final 1]
Q: [example problem 2]
A: [step-by-step solution 2]
Answer: [final 2]
Q: [your real problem]
A:
CRAFT + CoT (complex task with style)
[CONTEXT] [your context]
[ROLE] [your role]
[ACTION] [your action]. Walk me through your reasoning step by step
before giving the final output.
[FORMAT] [format]. Reasoning under heading "Reasoning"; output under
heading "Final".
[TONE] [tone]
Code debugging CoT
The following code produces [bug]:
[paste code]
Walk me through the execution step by step:
1. What does each line do?
2. Where does the actual behavior diverge from expected?
3. What's the root cause?
4. What's the fix?
Then provide the corrected code.
Common mistakes
- Adding CoT to everything. Verbose output without quality gain on simple tasks. Save it for multi-step problems.
- Using CoT on creative writing. Constrains useful variation. Reasoning structure hurts free-form prose.
- Truncating mid-chain. If the model runs out of context mid-reasoning, output is worse than no CoT. Use models with large context (128K+) for long chains.
- Ignoring the answer when reasoning is wrong. If the chain shows flawed logic, the answer is unreliable — even if it sounds correct. Validate.
- Skipping few-shot when accuracy matters. Zero-shot works; few-shot works better. Spend 5 minutes on examples for high-stakes prompts.
CoT in production AI
In production systems, CoT is rarely user-facing. Instead:
- Hidden CoT: model reasons internally, returns only the final answer (GPT-5 reasoning mode, Claude Opus 4 thinking).
- Validated CoT: chain is generated, parsed, validated against a rubric before the answer is shown.
- Logged CoT: chain is stored for debugging/auditability but hidden from users.
For chat-window prompting (manual use), explicit CoT is still the highest-leverage technique for math, code, and multi-step reasoning.
What's next
Chain-of-Thought is the gateway technique to:
- Few-shot prompting (broader than just CoT).
- Self-consistency (run CoT N times, vote).
- Tool use / function calling (model reasons, then calls tools).
- Agentic workflows (plan → execute → reflect → repeat).
If CoT clicks, you're 30% of the way to building production AI workflows. Pair it with structured output (JSON mode) and you've got the two highest-leverage techniques in modern prompt engineering.
Tools that ship CoT as one-click presets (Prompt Architects) save the boilerplate — but the skill is recognizing when to deploy CoT vs. when to skip it. That judgment doesn't come from a tool; it comes from running 50 prompts both ways and noting where CoT pays off.