TL;DR: Veo 3 prompts work best with a multi-part structure — subject, action, scene, camera, lighting, style, and audio. Google's own guidance confirms this ordering, and the audio block is the single biggest quality lever because Veo 3 generates synchronized native sound from text. JSON-style prompts and reference images unlock character consistency across multiple shots.
What is the best Veo 3 prompt structure?
The best Veo 3 prompt structure layers seven elements in order: subject, action, scene, camera, lighting, style, and audio. Front-load what the shot shows, then describe how it's filmed. Because Veo 3 generates synchronized native audio from text, an explicit audio block — dialogue in quotes plus ambient sound and score — is the single highest-impact addition you can make.
That ordering isn't a community guess. It mirrors the components Google DeepMind publishes in its official Veo prompt guide, which lists shot framing and motion, style, lighting, character descriptions, location, action, and dialogue as the building blocks of an effective prompt. The difference between a flat, stock-looking clip and a genuinely cinematic eight seconds usually comes down to whether you filled in all of those layers — or left the model to guess.
This guide breaks down each part of the structure, gives you 25 tested prompt patterns across five categories, and shows the JSON and reference-image techniques that keep a character looking the same across an entire sequence. Everything here applies to both Veo 3 and Veo 3.1, since the prompt grammar is identical between versions.
Why does prompt structure matter so much in Veo 3?
Veo 3 rewards specificity in a way earlier video models did not. Google's guidance is blunt about it: "the more detail you add, the more control you'll have over the final output." The model treats your prompt as a layered set of instructions, and when a layer is missing, it fills the gap with a default — usually a generic one.
Three things make structure especially important here:
- Native audio. Veo 3 is the first mainstream model to generate dialogue, sound effects, and ambient noise natively, synchronized to the picture. According to MindStudio's breakdown of the model, audio and video latents are denoised together at each step, which is why spoken lines land in sync with lip movement. If you don't write audio into the prompt, you're throwing away the model's headline feature.
- A hard token budget. Google's Gemini API video documentation caps the prompt at 1,024 tokens. You can't just keep adding adjectives — you have to spend words where they matter. Structure is how you budget.
- Cost per generation. Veo 3.1 with audio runs up to $0.40 per second of output, per Google's developer announcement and third-party pricing trackers. At eight seconds a clip, sloppy prompts that need five re-rolls get expensive fast. A disciplined structure cuts iterations.
Put simply: structure is the cheapest quality and cost lever you have. It costs nothing and it's the first thing to fix when output disappoints.
What are the parts of a Veo 3 prompt?
Here is the working template. Read it top to bottom the way Veo 3 does — subject first, audio last.
| Part | What it controls | Example |
|---|---|---|
| 1. Subject | Who or what is in the shot | "a 30-year-old woman with curly red hair, wearing a long wool coat, holding a leather portfolio" |
| 2. Action | What they're doing | "walking briskly across a wet cobblestone street, glancing back once over her shoulder" |
| 3. Scene | Where, when, weather | "Paris at dusk in late autumn, light rain falling, Notre Dame faintly visible, lamp posts lit" |
| 4. Camera | Lens, framing, movement | "medium close-up tracking shot from her right, 35mm lens, slight handheld feel, 24fps" |
| 5. Lighting | Source, direction, mood | "golden-hour warm light from the west mixing with cool blue streetlamp light, reflections on wet stone" |
| 6. Style | Genre, film stock, era | "cinematic, shot on 35mm film, slight grain, muted teal-and-orange grade" |
| 7. Audio | Dialogue, ambience, score | "footsteps on wet stone, distant traffic hum, faint church bells, sparse melancholic piano" |
You don't always need every row. A product turntable shot doesn't need dialogue. An abstract mood piece may not need a named subject. But the order should stay consistent: the model weights earlier tokens more heavily, so what comes first dominates the frame.
How does each part change the output?
- Subject is the anchor. Vague subjects ("a man," "a car") produce averaged, faceless results. Concrete subjects with age, hair, wardrobe, and props produce repeatable characters.
- Action drives motion and physics. Veo 3 simulates realistic movement, so describe the verb precisely — "sprinting," "ambling," "stumbling" all read differently.
- Scene sets the world. Skip it and the model invents a location, often a bland studio-grey nowhere. Specify time of day and weather and the lighting falls into place naturally.
- Camera is where cinematic descriptors earn their keep. This is the layer most beginners under-use.
- Lighting carries mood. "Golden hour" versus "overcast diffused" versus "neon underlit" produces three completely different emotional registers from the same subject.
- Style tells the model what kind of image it is — claymation, film noir on 35mm, anime, photoreal documentary. Google explicitly calls this out as a core lever.
- Audio is the multiplier. We'll spend a full section on it because it's the part most people skip.
What does a complete Veo 3 prompt look like?
Here's the full template assembled into one copy-pasteable prompt. This is what a finished, production-grade Veo 3 prompt actually reads like:
Subject: A 30-year-old woman with curly red hair, light freckles, wearing a
long charcoal wool coat, holding a leather portfolio.
Action: Walking briskly across a wet cobblestone street, glancing back over
her shoulder once, breath faintly visible in the cold air.
Scene: Paris at dusk in late autumn, light rain falling, Notre Dame just
visible in the background, soft fog, lamp posts lit.
Camera: Medium close-up tracking shot from her right side, 35mm lens,
slight handheld feel, 24fps. Camera moves at her walking speed.
Lighting: Golden-hour warm light from the west mixing with cool blue from
the streetlamps. Reflections on the wet cobblestones.
Style: Cinematic, shot on 35mm film, fine grain, muted teal-and-orange grade.
Audio: Leather shoes on wet stone, distant traffic hum, faint church bells,
a sparse melancholic piano score. She murmurs to herself: "Not yet."
Generate that and you get roughly eight seconds of cinematic video with synchronized audio — footsteps timed to her stride, a murmured line lip-synced to her mouth, ambient city sound underneath. The labeled-block format isn't mandatory, but it helps you audit your own prompt: if a block is empty, you know exactly which lever you left on the table.
You don't have to write blocks by hand every time. A tool like Prompt Architects can take a one-line idea and expand it into this seven-part structure automatically, then save it to a reusable library so your best shot templates are one click away. That matters more than it sounds — most of the quality difference between amateur and professional Veo output is just consistency of structure, and a saved template enforces it.
How do you write audio prompts for Veo 3?
This is the section that separates good Veo 3 output from forgettable output. Veo 3 generates native audio synchronized to the picture, and the official prompt guide confirms you should "explicitly integrate audio cues matching the visuals." Most users skip the audio block entirely — which is exactly why their clips look like silent stock footage with a generic hum dropped on top.
Think in three layers. Specify each one separately.
Layer 1: Dialogue
Put spoken lines in quotation marks. This is the format Google recommends, and it's how the model knows to lip-sync.
She says: "I think we should turn back."
A grizzled fisherman mutters, "Storm's coming. Tie her down."
You can also specify delivery: "whispered," "shouted," "trembling," "deadpan." Veo 3 processes audio and video together, so when you write dialogue in quotes the model times the speech to the speaker's mouth movement. The veo3ai team's native audio prompt guide notes that keeping lines short (under roughly 12 words for an eight-second clip) gives the cleanest sync, since the model has to fit the whole line into the available runtime.
Layer 2: Ambience and sound effects
Describe environmental sound and discrete effects. Google's convention is to prefix effects with "SFX:".
SFX: waves crashing, seagulls calling, distant beach voices.
Ambient: low office hum, a printer in the next room, occasional keyboard clicks.
Without an ambience layer, scenes feel acoustically dead — the visual is moving but the soundstage is empty. One sentence is usually enough to fill it.
Layer 3: Score
Specify mood, instrumentation, and tempo for any musical bed.
Soft melancholic piano with sparse strings, slow tempo, building gently.
Skip the score in dialogue-heavy scenes where music would fight the voice. Add it when atmosphere is the point — abstract mood pieces, product hero shots, montages.
| Audio layer | When to use it | Prompt snippet |
|---|---|---|
| Dialogue | Character is speaking on camera | She says: "We have to leave now." |
| Ambience / SFX | Almost every shot — fills the soundstage | SFX: city traffic, distant sirens, light rain |
| Score | Mood-driven shots, montages, commercials | warm acoustic guitar, mid-tempo, hopeful |
A practical rule: never submit a Veo 3 prompt with zero audio cues. Even a single ambience line lifts the result dramatically. If you're building shot templates in a prompt library, bake a default ambience line into each one so you can't forget it.
What are 25 tested Veo 3 prompt examples?
Below are 25 prompt patterns grouped into five categories. Each is a starting point — adapt the subject, scene, and audio to your project, but keep the seven-part skeleton intact. These cover the most common real-world use cases for AI video.
Cinematic narrative (5)
- Emotional close-up. A solo character in a quiet moment, eye-level medium close-up, shallow depth of field, soft window light, sparse piano score, a single murmured line of dialogue.
- Two-character dialogue. Shot-reverse-shot framing implied by "over-the-shoulder, 50mm," two named characters with distinct wardrobes, dialogue in quotes for both, room-tone ambience.
- City tracking shot. Third-person follow shot through a crowded street, gimbal-smooth movement at walking pace, golden hour, layered street ambience.
- Slow push-in on an object. Static-to-dolly-in on a key prop (a letter, a ring, a phone), macro detail, dramatic side light, a slow building string note.
- Wide establishing reveal. Crane-up wide shot revealing a landscape or skyline at blue hour, slow camera rise, wind and distant city ambience, a swelling orchestral cue.
Product and commercial (5)
- Hero turntable. Product centered on a slow-rotating turntable, studio softbox lighting, seamless backdrop, 85mm lens, no dialogue, clean ambient pad with a subtle whoosh on the reveal.
- Lifestyle placement. Product in a natural domestic setting (kitchen counter, desk), warm practical lighting, a hand entering frame to use it, cozy room ambience.
- Liquid pour. Slow-motion liquid pouring into a glass, side lighting to catch translucency, 60fps slow-motion feel, crisp pour and fizz SFX.
- Top-down reach. Top-down framing, a hand reaching for the product on a styled flat-lay surface, even soft light, tactile foley as fingers make contact.
- Gradient sweep reveal. Product reveal as a light gradient sweeps across it left to right, dark-to-bright, minimalist studio, a single rising synth tone.
Abstract and mood (5)
- Fabric in wind. Slow-motion silk fabric billowing in wind against a colored gradient, backlit, dreamy soft-focus, an airy ambient drone.
- Particles in light. Dust or particles drifting through colored volumetric light beams, dark background, slow real-time motion, soft pad score.
- Macro surface tension. Extreme macro of a water droplet hitting a surface, high-speed slow motion, ring light, crisp impact SFX with a low resonant tail.
- Cloud time-lapse. Time-lapse of clouds rolling over a horizon, shifting golden-to-blue light, no subject, wind and faint atmospheric score.
- Morphing geometry. Geometric shapes smoothly morphing on a neutral seamless background, clean studio light, minimalist, a rhythmic ambient pulse.
Documentary and interview (5)
- Talking head. Subject seated, eye-level, soft natural window light, slight rack focus to background, room ambience, dialogue in quotes ("When I started, nobody believed it would work.").
- Working hands. Close-up of skilled hands at a craft (pottery, woodworking), shallow focus, warm workshop light, rich tool and material foley.
- Walk-and-talk. Handheld follow shot of a subject walking and speaking to camera, natural daylight, dialogue in quotes, footsteps and street ambience.
- B-roll insert. Detail shots of an environment — textures, signage, objects — no people, gentle camera drift, location-specific ambience.
- Environmental portrait. Subject framed within their setting (a baker in a bakery), wide-to-medium, practical light, ambient workplace sound, optional single line.
Action and kinetic (5)
- Skateboard trick. Low-angle shot of a skateboarder landing a trick, 60fps slow-motion feel, harsh midday sun, wheels-on-concrete and board-clack SFX.
- Sunrise runner. Side-tracking shot of a runner on a track at sunrise, long lens compression, backlit rim light, breathing and footfall ambience, driving score.
- Cooking sequence. Quick cuts of chopping, sizzling, and plating, top-down and macro mix, warm kitchen light, layered cooking SFX (chop, sizzle, scrape).
- Marketplace movement. Top-down or high-angle of crowd movement in a busy market, smooth slow pan, vibrant natural light, dense market ambience.
- Vehicle drive-by. Tracking a car passing camera with motion blur, low angle, overcast diffused light, engine doppler and tire SFX.
Want the full, expanded prompt text for each of these? We keep a continually updated set in our Veo 3 prompt examples library, formatted for one-click copy.
Which camera modifiers work in Veo 3?
Veo 3 was trained on cinematic terminology and parses it directly — you don't need to translate "dolly in" into plain English. The DreamHost Veo 3.1 prompt guide and Google's own examples both lean heavily on standard film vocabulary. Mix two or three modifiers per shot; stacking five contradictory ones is the fastest way to confuse the framing.
| Category | Modifiers that work reliably |
|---|---|
| Framing | wide shot, medium shot, medium close-up, close-up, extreme close-up, two-shot, over-the-shoulder |
| Movement | static, pan left/right, tilt up/down, dolly in/out, tracking shot, handheld, gimbal-smooth, crane up/down, push-in |
| Angle | eye-level, low angle, high angle, Dutch tilt, top-down, worm's-eye |
| Lens | 24mm wide, 35mm standard, 50mm portrait, 85mm telephoto, macro, fisheye |
| Speed / framerate | 24fps cinematic, 60fps slow-motion feel, time-lapse, real-time |
A common pattern that works: one framing modifier + one movement modifier + one lens. For example, "medium close-up, slow dolly in, 50mm." That gives the model a clear, non-contradictory instruction it can execute cleanly.
What about lighting modifiers?
Lighting carries the emotional weight of a shot. Specify source, direction, and mood — and let the scene's time of day do some of the work for you.
| Category | Modifiers |
|---|---|
| Source | natural daylight, golden hour, blue hour, overcast diffused, studio softbox, neon, candlelight, firelight, practicals |
| Direction | front-lit, side-lit, backlit, top-lit, underlit, rim light |
| Mood | warm, cool, high-contrast, low-contrast, moody, ethereal, gritty, cinematic, dreamy |
If you specify "Paris at dusk, light rain" in the scene block, you've already implied a cool, low-contrast, reflective lighting condition. The lighting block then refines it — "golden-hour warmth mixing with cool streetlamp blue" — rather than starting from scratch. Scene and lighting should reinforce each other, never contradict.
How do you keep a character consistent across multiple shots?
Single-shot consistency is easy; the hard problem is keeping the same character looking identical across a five- or ten-shot sequence. Veo 3.1 made real progress here. Per Google's developer announcement, the update delivers "improved character consistency across multiple scenes," plus a feature called Ingredients to Video that lets you guide generation with up to three reference images.
You have three approaches, from least to most reliable:
Approach 1: Prompt repetition (text-only)
Lock the subject description verbatim at the top of every shot prompt. Same name, age, hair, wardrobe, distinguishing features — word for word. This is the lowest-effort method and works acceptably for two or three shots, but drift creeps in over longer sequences.
Approach 2: JSON character object (text-only, more stable)
Define the character once in a structured object and reference it in every shot. This forces consistency by giving the model an identical token block each time:
{
"character": {
"name": "Sarah",
"age": 30,
"appearance": "curly red hair, shoulder-length, green eyes, light freckles",
"wardrobe": "long charcoal wool coat, black leather boots, leather portfolio"
},
"world": {
"location": "Paris, autumn dusk, light rain",
"palette": "warm golden plus cool blue contrast",
"style": "cinematic 35mm, fine grain"
}
}
Paste the character and world blocks at the top of each shot prompt, then add only the per-shot action and camera below. Because the appearance tokens are byte-for-byte identical across shots, the model has far less room to drift. This JSON-style approach is also easy to template — see our Veo 3 JSON prompting walkthrough for a fuller breakdown.
Approach 3: Reference images (most reliable)
For Veo 3.1, the strongest method is Ingredients to Video: supply up to three reference images of your character (and optionally a product or location), and the model anchors generation to them. The Gemini API exposes this through the referenceImages parameter, per the video documentation. Image anchoring beats any amount of text description for faces, because a face is genuinely hard to pin down in words.
| Method | Effort | Consistency | Best for |
|---|---|---|---|
| Prompt repetition | Low | Fair | 2-3 shots, quick tests |
| JSON character object | Medium | Good | 5-10 shot text-only sequences |
| Reference images (Veo 3.1) | Medium | Strong | Faces, products, anything that must match exactly |
For anything where a real face or branded product has to stay identical, reach for reference images. Text-only methods are for speed and prototyping.
What are the most common Veo 3 prompt mistakes?
After enough generations, the failure patterns become predictable. Here are the ones that quietly wreck output, and the fix for each.
- No audio cues. This is the number-one mistake. Visual-only prompts produce silent clips or a generic mismatched hum. Fix: always add at least one ambience line; add dialogue and score when relevant.
- Generic subjects. "A woman walks" gives you faceless, averaged stock footage. Fix: add age, hair, wardrobe, and a prop. Specificity is what makes a subject repeatable.
- Contradictory framing. "Wide shot close-up" or "static handheld tracking shot" confuses the model and produces mush. Fix: pick one framing, one movement, one lens.
- Skipped scene context. No location, no weather, no time of day means the model picks bland defaults — usually a grey studio nowhere. Fix: always specify where, when, and what the weather is.
- One mega-prompt for a whole sequence. Trying to describe five shots in a single block overruns the token budget and muddles every shot. Fix: use one prompt per shot, with a shared JSON character or reference image for consistency.
- Over-stuffing the prompt. Past roughly 300 words (and certainly near the 1,024-token cap), the model starts dropping constraints. Fix: cut redundant adjectives; spend words on the layers that matter.
- Ignoring aspect ratio. Leaving it default gives you 16:9 even when you needed vertical. Fix: specify
9:16 verticalfor social or1:1 squarefor feed explicitly.
If you're generating across ChatGPT, Claude, Gemini, Midjourney, and Veo 3 regularly, these mistakes compound. A consistent enhancement workflow — like the one-click structure expansion in Prompt Architects — catches most of them before you spend a generation credit.
What are the technical specs and limits of Veo 3?
Knowing the hard constraints stops you from writing prompts the model can't honor. Here's the current state of Veo 3 and Veo 3.1, drawn from Google's Gemini API documentation.
| Spec | Veo 3 / Veo 3.1 |
|---|---|
| Resolution | 720p (default), 1080p, 4K |
| Clip duration | 4, 6, or 8 seconds (8s required for 1080p/4K and reference-image generations) |
| Frame rate | 24fps |
| Aspect ratios | 16:9 (landscape) and 9:16 (portrait) |
| Prompt length | Up to 1,024 tokens |
| Audio | Native, synchronized (dialogue, SFX, ambience, score) |
| Reference images | Up to 3 (Ingredients to Video, Veo 3.1) |
| Scene extension | Chains clips into videos a minute or longer (Veo 3.1) |
A few practical notes:
- Eight seconds is the working ceiling per clip. To go longer, use scene extension in Veo 3.1, which generates new clips that connect to the final second of the previous one, per Google's announcement.
- Specify aspect ratio in the prompt for anything not 16:9. Write
9:16 vertical for socialor1:1 square for feeddirectly. - Cost scales with audio and resolution. Audio-on, higher-resolution generations cost more — up to $0.40/second on the top Veo 3.1 tier — so prototype at lower settings and finalize at high.
What changed between Veo 3 and Veo 3.1?
Veo 3.1 launched on October 15, 2025, according to Google's developer blog. The headline upgrades:
- Richer native audio — more natural conversation and better-synchronized sound effects.
- Better character consistency across multiple scenes.
- Scene extension to build longer videos from connected clips.
- First-and-last-frame control — supply a start and end image and the model generates the transition between them, with audio.
- Ingredients to Video — up to three reference images for character, object, or scene consistency.
Crucially, pricing stayed the same as Veo 3, and the prompt grammar didn't change. Every technique in this guide works on both versions — 3.1 just executes them more reliably.
How do you build a repeatable Veo 3 workflow?
One great clip is luck. A repeatable look is a workflow. Here's the loop that consistently produces professional output:
- Pick a category from the 25 patterns above that matches your shot.
- Fill the seven-part template — subject, action, scene, camera, lighting, style, audio. Don't leave a block empty unless it genuinely doesn't apply.
- Add explicit audio in all three layers that fit: dialogue, ambience, score.
- Generate at low settings first (720p, audio optional) to check composition cheaply.
- Iterate one variable at a time. Change only the camera, or only the lighting, between attempts. Changing three things at once means you can't tell what helped.
- Save what works. When a modifier combination reliably produces the look you want, store it as a preset. Your personal library of proven shot templates is the real asset.
- Finalize at high settings (1080p or 4K, 8 seconds, audio on) once the structure is locked.
That "save what works" step is where most people leave value on the table. The difference between a hobbyist and someone shipping consistent video isn't talent — it's a library of battle-tested prompt structures they can reuse and remix. Tools with a save-and-reuse prompt library and global variables (swap the subject across ten saved templates at once) turn that library into a genuine production system. If you also work across image and text models, the same discipline applies — see our AI video prompting fundamentals for the cross-model view.
Frequently asked questions
What's the best prompt structure for Veo 3? Google recommends a multi-part structure: subject + action + scene/environment + camera + lighting + style + audio. Order matters — front-load the subject and action, then describe how the shot is filmed. Because Veo 3 generates synchronized native audio, an explicit audio block is the single biggest quality lever.
How long should a Veo 3 prompt be? Aim for 120-300 words for most shots, staying under the 1,024-token prompt limit Google enforces. Below 60 words, output drifts toward generic stock footage. For character-consistent multi-shot sequences, switch to JSON-style prompts or reference images rather than one long block.
Does Veo 3 understand camera terminology? Yes. Veo 3 was trained on cinematic descriptors. Wide shot, medium close-up, dolly in, handheld, low angle, Dutch tilt, 35mm lens, and 24fps are all parsed correctly. Mixing two or three camera modifiers per shot is the sweet spot.
How do I get Veo 3 to generate matching audio? Specify audio explicitly in three layers: dialogue in quotation marks, ambient sound or SFX, and score. Veo 3 processes audio and video together during generation, so spoken lines written in quotes get lip-synced to the speaker.
Can Veo 3 maintain character consistency across shots? Yes, more reliably in Veo 3.1. Lock the subject description at the top of every shot prompt, use a JSON character object, or — most reliably — use Ingredients to Video with up to three reference images.
What's the difference between Veo 3 and Veo 3.1? Veo 3.1 launched in October 2025 with richer native audio, better character consistency, scene extension, first-and-last-frame transitions, and reference-image guidance. Prompt structure is identical, so this guide applies to both.
What resolution and length does Veo 3 output? Veo 3 and 3.1 output 4, 6, or 8-second clips at 720p, 1080p, or 4K, at 24fps in 16:9 or 9:16. Eight seconds is required for 1080p, 4K, or reference-image generations. Scene extension chains clips into longer videos.
Why does my Veo 3 video look generic? Usually missing audio cues and vague subjects. Add an explicit audio block, make the subject specific (age, hair, wardrobe, props), and always include a scene with location, weather, and time of day.
By Nafiul Hasan — Founder of Prompt Architects, building AI prompt-enhancement tools used across ChatGPT, Claude, Gemini, Midjourney, and Veo 3. Last updated: June 10, 2026.