title: "Veo 3 Prompt Structure: Complete Guide with 25 Examples (2026)" slug: "21-veo3-prompt-structure" description: "The 6-part Veo 3 prompt structure — subject, action, scene, camera, lighting, audio. 25 tested prompts. Audio sync tricks. JSON character mode." publishedAt: "2026-05-11" updatedAt: "2026-05-11" postNum: 21 pillar: 3 targetKeyword: "veo3 prompt structure" keywords:
- "veo3 prompt structure"
- "veo 3 prompt"
- "google veo 3"
- "ai video prompt"
- "veo3 prompt examples" ogImage: "https://prompt-architects.com/og/21-veo3-prompt-structure.png" author: name: "Nafiul Hasan" role: "Founder, Prompt Architects" url: "https://prompt-architects.com/about" ctaFeature: "video" related: [22, 23, 27] faq:
- q: "What's the best prompt structure for Veo 3?" a: "The 6-part structure: subject + action + scene + camera + lighting + audio. Veo 3 uniquely supports synchronized audio, so explicit audio cues (dialogue, ambience, score) outperform prompts that omit them. Order matters — front-load subject and action; camera and lighting modifiers go in the back half."
- q: "How long should a Veo 3 prompt be?" a: "120-300 words for most shots. Below 60 words, output gets generic. Above 400, the model starts losing track of constraints. For character-consistent multi-shot sequences, use JSON mode rather than a single long prompt."
- q: "Does Veo 3 understand camera terminology?" a: "Yes. Veo 3 trained on cinematic descriptors. Wide shot, medium close-up, dolly in, handheld, low angle, Dutch tilt, 35mm lens — all parsed correctly. Mixing 2-3 camera modifiers per shot is the sweet spot."
- q: "How do I get Veo 3 to generate matching audio?" a: "Specify audio explicitly. Three layers: dialogue ('she says: "hello"'), ambient ('city street sounds, distant traffic'), and score ('soft piano, melancholic'). Don't assume the model will infer audio from visual context — it often won't."
- q: "Can Veo 3 maintain character consistency across shots?" a: "Partially via prompt repetition; reliably via JSON character mode. Lock subject (name, age, wardrobe, distinguishing features) at the top of every shot's prompt. Better: use JSON to define a character object once and reference it across multiple shot prompts."
TL;DR: Veo 3 prompts work best with the 6-part structure: subject + action + scene + camera + lighting + audio. Audio cues are the biggest single quality lever. JSON mode unlocks character consistency.
The 6-part Veo 3 prompt structure
Order matters. The model reads top-down. Front-load what the shot is of; back-load how it's filmed.
| Part | What it does | Example |
|---|---|---|
| 1. Subject | Who or what is in the shot | "a 30-year-old woman with curly red hair, wearing a wool coat" |
| 2. Action | What they're doing | "walking briskly across a wet cobblestone street" |
| 3. Scene | Where, when, weather | "Paris at dusk in autumn, light rain, Notre Dame in background" |
| 4. Camera | Lens, framing, movement | "medium close-up, 35mm lens, slow tracking shot from her right" |
| 5. Lighting | Source, direction, mood | "golden hour warm light, soft glow on cobblestones" |
| 6. Audio | Dialogue, ambience, score | "footsteps on wet stone, distant traffic, soft piano score" |
A complete example
Subject: A 30-year-old woman with curly red hair, wearing a long wool coat,
holding a leather portfolio.
Action: Walking briskly across a wet cobblestone street, glancing back over
her shoulder once.
Scene: Paris at dusk in late autumn, light rain falling, Notre Dame just
visible in the background, soft fog, lamp posts lit.
Camera: Medium close-up tracking shot from her right side, 35mm lens,
slight handheld feel, 24fps. Camera moves at her walking speed.
Lighting: Golden hour warm light from the west, mixing with cool blue
from streetlamps. Reflections on wet cobblestones.
Audio: Sound of leather shoes on wet stone, distant traffic hum,
faint church bells, sparse melancholic piano score.
Result: 8 seconds of cinematic-quality video, audio synchronized.
The audio layer (Veo 3's biggest quality lever)
Veo 3 generates synchronized audio from text. Most users skip the audio block — that's why their videos look like silent stock footage.
Three audio layers worth specifying explicitly:
1. Dialogue
She says: "I think we should turn back."
Veo 3 generates voice synced to lip movement. Specify tone if needed: "whispered", "shouted", "trembling".
2. Ambience
Sound of waves crashing, seagulls calling, distant beach voices.
Without ambience, scenes feel acoustically dead. One sentence is enough.
3. Score
Soft melancholic piano score with sparse strings, slow tempo.
Skip when the scene is dialogue-heavy. Add when atmosphere is the point.
25 tested Veo 3 prompts (categories)
Cinematic narrative (5)
- Solo character moment with emotional close-up + soft score
- Two-character dialogue scene with shot-reverse-shot framing
- Third-person tracking shot through a city
- Slow push-in on a key object
- Wide establishing shot with camera reveal
Product / commercial (5)
- Hero product on rotating turntable with studio lighting
- Lifestyle product placement in domestic setting
- Liquid pour into glass, slow-motion, side lighting
- Hand reaching for product, top-down framing
- Product reveal with light gradient sweep
Abstract / mood (5)
- Slow-motion fabric movement in wind
- Particles drifting through colored light
- Macro of liquid surface tension
- Time-lapse of clouds with shifting light
- Geometric shapes morphing on neutral background
Documentary / interview (5)
- Talking head, eye-level, soft natural light, slight rack focus
- Working hands close-up with ambient workshop sounds
- Walking-and-talking handheld follow shot
- B-roll insert: details of an environment
- Environmental portrait with subject in setting
Action / kinetic (5)
- Skateboard trick, low angle, 60fps suggested
- Runner on track at sunrise, side tracking
- Cooking sequence: chopping, sizzling, plating cuts
- Crowd movement in a marketplace, top-down
- Vehicle drive-by with motion blur
(Full prompt text for each available in our Veo 3 prompt library.)
Camera modifier reference
Veo 3 understands cinematic terminology directly. Mix 2-3 per shot:
| Category | Modifiers that work |
|---|---|
| Framing | wide shot, medium shot, medium close-up, close-up, extreme close-up, two-shot, over-the-shoulder |
| Movement | static, pan left/right, tilt up/down, dolly in/out, tracking shot, handheld, gimbal smooth, crane up/down |
| Angle | eye-level, low angle, high angle, Dutch tilt, top-down, worm's eye |
| Lens | 24mm wide, 35mm standard, 50mm portrait, 85mm telephoto, macro, fisheye |
| Speed | 24fps cinematic, 60fps slow-motion, time-lapse, real-time |
Lighting modifier reference
| Category | Modifiers |
|---|---|
| Source | natural daylight, golden hour, blue hour, overcast diffused, studio softbox, neon, candlelight, firelight |
| Direction | front-lit, side-lit, backlit, top-lit, underlit |
| Mood | warm, cool, high-contrast, low-contrast, moody, ethereal, gritty, cinematic, dreamy |
JSON character mode (multi-shot consistency)
For sequences where the same character appears across shots, define them once in JSON:
{
"character": {
"name": "Sarah",
"age": 30,
"appearance": "curly red hair shoulder-length, green eyes, light freckles",
"wardrobe": "long charcoal wool coat, black leather boots, leather portfolio"
},
"world": {
"location": "Paris, autumn dusk, light rain",
"palette": "warm golden + cool blue contrast"
}
}
Reference {{character}} and {{world}} in subsequent shot prompts. Consistency across 5-10 shots becomes feasible.
Common mistakes that produce generic Veo 3 output
- No audio cues. Visual-only prompts produce silent or mismatched audio.
- Generic subjects. "A woman walks" → faceless, generic. "A 30-year-old woman with curly red hair walks" → specific, repeatable.
- Mixed framing instructions. "Wide shot close-up" confuses the model. Pick one.
- Skipped scene context. No location, no weather, no time of day → output picks defaults.
- Single long prompt for multi-shot sequences. Use JSON character mode + per-shot prompts instead.
Resolution and duration
Veo 3 currently outputs:
- Resolution: up to 1080p (4K on select tiers)
- Duration: 8 seconds default, extendable
- Aspect ratios: 16:9, 9:16, 1:1
Specify aspect ratio in the prompt for vertical / square output: 9:16 vertical for social, 1:1 square for feed.
What to do next
- Pick one of the 25 categories above.
- Adapt the 6-part template with your specifics.
- Add explicit audio (3 layers).
- Generate. Iterate by tightening one variable per attempt.
Track which modifiers reliably produce the look you want — they'll become your personal preset library.