JSON Video Prompt Templates for Veo 3 (Production-Ready, 2026)

TL;DR: JSON video prompts for Veo 3 are not magic syntax the model parses literally — they are a discipline that forces you to specify every dimension a great shot needs. This guide gives you six production-tested JSON templates (hero shots, character-consistent multi-shot, dialogue, a four-clip ad, B-roll, and stylized), a field-by-field reference grounded in Google's official prompt guidance, variable-injection patterns for scaling, and the mistakes that quietly wreck output. Copy, paste, swap placeholders, render.

What is a JSON video prompt for Veo 3, and why use one?

A JSON video prompt for Veo 3 is a structured prompt that organizes your shot into labeled fields — subject, action, scene, camera, lighting, and audio — instead of a single run-on sentence. Veo 3 does not require JSON, but the format forces you to specify every dimension the model needs, makes prompts reusable as templates, and lets you swap variables to generate consistent variants at scale. For multi-shot work, it is the most reliable approach.

That direct answer matters because there is a persistent myth that Veo 3 "speaks JSON" the way an API endpoint does. It does not. Google's own Veo prompt guide is written entirely around descriptive, cinematic sentences, and the official Veo 3.1 prompting guide on Google Cloud recommends a natural-language five-part formula. There is no documented JSON schema the model validates against.

So why does half the creator community swear by JSON video prompts? Because the value is on your side of the keyboard, not the model's. When you write a shot as JSON, you cannot quietly forget to specify the lighting direction, the lens, or the ambient sound — the empty key stares back at you. Veo 3 reads the words inside your JSON and maps them to the same concepts it would extract from prose, but the structure guarantees you actually wrote those words. That is the whole trick, and once you internalize it, the templates below become obvious.

This is the same principle behind structured prompting for any model: the schema is a checklist that happens to be machine-readable.

How does Veo 3 actually read a prompt?

Before the templates, you need an accurate mental model of what Veo 3 is doing, because it changes how you write every field.

Veo 3.1 is a text-to-video and image-to-video model from Google DeepMind that generates short clips with native synchronized audio — dialogue, sound effects, ambience, and score, all produced in one pass. According to the Google Cloud prompting guide, it outputs clips of 4, 6, or 8 seconds at 720p or 1080p in 16:9 or 9:16, with select tiers offering 4K upscaling on Vertex AI and the Gemini API.

The model was trained primarily on natural language paired with video. When you feed it JSON, it does not run a JSON parser — it reads the tokens, including the keys, and treats them as semantic context. The string "lighting": "warm golden hour rim light from camera-right" lands in the model roughly the same way the sentence "the scene is lit by warm golden hour rim light from camera-right" would. The key lighting reinforces intent; the value carries the substance.

Three consequences follow directly:

Vague values produce vague video. "camera": "cinematic" gives Veo 3 almost nothing. "camera": "medium close-up, 50mm, slow dolly in" gives it a shot. The JSON structure does not rescue weak descriptions.
Context does not persist between prompts. Each generation is independent. If shot 3 references "MAYA" but you defined MAYA only in shot 1, the model has no idea who MAYA is in shot 3. You must re-describe.
Audio is first-class, not an afterthought. Because Veo 3 generates audio natively, an empty or missing audio field means the model invents its own soundscape — usually generic. Specify it.

Google's guide frames the ideal prompt as a five-part formula: Cinematography, Subject, Action, Context, and Style & Ambiance. Every JSON template in this article is just that formula expressed as keys, with audio broken out because it deserves its own block.

The field-to-formula map

Here is how the JSON keys you will see below map to Google's official five-part structure, so you know nothing is missing:

Google's five-part element	JSON key(s) used in this guide	What goes here
Cinematography	`camera` (framing, lens, movement)	Shot size, lens, and how the camera moves
Subject	`subject` / `character_lock`	Who or what, with explicit physical descriptors
Action	`action`	What the subject does, in order, with timing cues
Context	`scene` (location, time, weather)	Where and when, plus atmosphere
Style & Ambiance	`lighting`, `style_anchor`, `audio`	Light direction and temperature, visual style, and the full soundscape

If a JSON key has no analog in that table, Veo 3 will still read the words — but the table is your guarantee of completeness.

What are the core fields every Veo 3 JSON prompt needs?

Every reliable template shares a spine of fields. Get these right and the model has enough to work with; leave them vague and no amount of structure helps.

Subject — Explicit physical descriptors and wardrobe. Age, build, hair, distinguishing features, exact clothing. "A woman" is useless. "A 32-year-old woman with curly auburn hair, freckles, wearing a cream linen blazer over a white t-shirt" is a character.

Action — What happens, in order. Veo 3 handles sequenced action well if you spell it out: "walks toward camera, pauses, looks into lens, half-smiles." Vague verbs like "moves around" produce drift.

Scene — Location, time of day, weather. This is Google's "Context." Sensory specifics ("exposed brick, hanging plants, soft window light") beat generic labels ("a cafe").

Camera — Framing, lens, and movement, as three separate ideas. The official guide lists a working vocabulary: dolly shot, tracking shot, crane shot, aerial view, slow pan, POV shot for movement; wide shot, close-up, extreme close-up, low angle, two-shot for composition; shallow depth of field, wide-angle lens, macro lens, deep focus for the lens.

Lighting — Source, direction, and temperature. "Soft lighting" is meaningless. "Warm golden-hour rim light from camera-right, soft fill from camera-left" is a setup the model can execute.

Audio — Dialogue, ambience, and score, each explicit. Wrap spoken lines in quotation marks. The Google Cloud guide recommends notation like SFX: thunder cracks in the distance and Ambient noise: the quiet hum of a starship bridge. In JSON, those become the audio sub-keys below.

Here is a quick reference you can keep open while writing:

Field	Weak (avoid)	Strong (use)
subject	"a man"	"a 40-year-old man, salt-and-pepper beard, charcoal wool coat"
action	"he does something"	"lifts the cup, sips, sets it down, glances off-frame left"
scene	"outside"	"rain-slicked Tokyo alley at night, neon signage, wet asphalt reflections"
camera	"nice angle"	"low-angle medium, 35mm, slow tracking push-in"
lighting	"good lighting"	"cyan neon key from camera-left, warm practical bounce, hard rim from behind"
audio	(omitted)	"ambience: rain, distant traffic; score: slow synthwave; dialogue: none"

Now the templates. Each is copy-pasteable. Swap the values, keep the structure.

Template 1: How do I write a hero shot in JSON? (single 8-second clip)

A hero shot is your single best frame — a founder spotlight, a product reveal, a brand opener. It is one subject, one beat, fully controlled.

{
  "subject": {
    "description": "A 32-year-old woman with curly auburn hair, freckles, wearing a cream linen blazer over a white t-shirt and dark jeans",
    "distinguishing_features": "small silver pendant necklace, slight nose ring"
  },
  "action": "walks slowly toward camera, pauses, looks directly into lens with a confident half-smile",
  "scene": {
    "location": "modernist concrete-and-glass office lobby",
    "time": "late afternoon, golden hour",
    "weather": "clear, soft warm light streaming through floor-to-ceiling windows"
  },
  "camera": {
    "framing": "medium close-up, eye-level",
    "lens": "50mm prime",
    "movement": "slow dolly in, ending tight on face"
  },
  "lighting": "warm golden hour rim light from camera-right, soft fill from camera-left, subtle catchlight in eyes",
  "audio": {
    "dialogue": "none",
    "ambience": "soft city ambience, distant footsteps echoing on marble floor",
    "score": "subtle uplifting orchestral swell, building to a held note"
  },
  "duration_seconds": 8,
  "aspect_ratio": "16:9"
}

Use case: Brand hero shot, founder spotlight, product launch lead-in.

Why it works: Every field is specified, so Veo 3 has no gaps to fill with its house aesthetic. The movement value ("slow dolly in, ending tight on face") gives the clip a beginning and end — critical in an 8-second window. The catchlight note in lighting is the kind of detail that separates a flat render from a polished one.

Before committing budget, render a 4-second version of this first. Because Veo 3 bills per second, a short reference render lets you check framing and lighting cheaply before the full 8-second take.

Template 2: How do I keep a character consistent across many shots?

This is the question that breaks most beginners. The answer has two layers.

Layer one: re-describe verbatim, every shot. Veo 3 generates each prompt independently and carries no memory between them. Define your character once as a lock object:

{
  "character_lock": {
    "name": "MAYA",
    "physical": "32, curly auburn hair shoulder-length, freckles, green eyes, 5'7\"",
    "wardrobe": "cream linen blazer, white t-shirt, dark jeans, white sneakers",
    "distinguishing_features": "small silver pendant necklace, slight nose ring, gestures with hands when speaking",
    "voice": "warm mid-range, slight rasp, speaks at measured pace"
  }
}

Then in every shot, paste the full description into subject — do not reference "MAYA" and assume the model remembers:

{
  "shot_id": "shot_03",
  "subject": "MAYA — 32, curly auburn shoulder-length hair, freckles, green eyes, 5'7\", cream linen blazer over white t-shirt, dark jeans, white sneakers, small silver pendant, slight nose ring",
  "action": "sits at a wooden desk with laptop open, types a few words then leans back thinking, taps pen against chin",
  "scene": {
    "location": "minimalist home office, warm wood desk, single Eames chair",
    "time": "mid-morning",
    "weather": "soft overcast light through window"
  },
  "camera": {
    "framing": "wide shot, eye-level",
    "lens": "35mm",
    "movement": "static"
  },
  "lighting": "soft daylight from camera-left window, warm practical lamp on desk",
  "audio": {
    "dialogue": "none",
    "ambience": "soft keyboard typing, distant birds, clock ticking",
    "score": "minimal piano, contemplative mood"
  },
  "duration_seconds": 6,
  "aspect_ratio": "16:9"
}

Layer two: use Ingredients to Video. Text re-description gets you close, but the most reliable consistency tool in Veo 3.1 is the Ingredients to Video feature, which the Google Cloud guide describes as supplying reference images of a character, scene, object, or style for consistency across shots. Generate one strong reference image of MAYA — Google's workflow uses Gemini 2.5 Flash Image for this — and attach it to each generation. The text lock plus the image reference together hold a face across a sequence far better than either alone.

This is the single biggest reliability upgrade over the older "just describe carefully" approach. If your project lives or dies on a recurring face, invest in the reference image. For a deeper treatment of consistency across an entire series, see our guide on character-consistent AI video workflows.

Template 3: How do I write a dialogue scene in JSON?

Dialogue is where JSON earns its keep, because speaker attribution and voice direction become unambiguous.

{
  "subject": {
    "description": "Two friends at a coffee shop. PERSON_A: 28, short dark hair, denim jacket. PERSON_B: 30, shaved head, navy hoodie",
    "distinguishing_features": "PERSON_A holds a latte, PERSON_B has a notebook open"
  },
  "action": "PERSON_A leans in, then both react and laugh, raising coffee cups in a small toast",
  "scene": {
    "location": "warm independent coffee shop, exposed brick, hanging plants",
    "time": "weekday morning",
    "weather": "soft natural light from large window"
  },
  "camera": {
    "framing": "two-shot medium, eye-level, slight over-the-shoulder bias toward PERSON_A",
    "lens": "35mm",
    "movement": "subtle handheld, organic slight sway"
  },
  "lighting": "natural window light from camera-right, warm amber bounce from interior",
  "audio": {
    "dialogue": "PERSON_A (excited): \"Wait — that actually worked?\" PERSON_B (laughing): \"Yeah, on the third try.\"",
    "ambience": "espresso machine hiss, distant chatter, soft jazz playing",
    "score": "none, naturalistic"
  },
  "duration_seconds": 8,
  "aspect_ratio": "16:9"
}

Why JSON for dialogue: The dialogue value carries explicit speaker labels, parenthetical delivery notes (excited, laughing), and quotation marks around the spoken words — exactly the notation Google's guide recommends for putting specific speech in a prompt. Veo 3 generates the audio natively and syncs lip movement, so the cleaner your attribution, the cleaner the sync.

A few dialogue rules that hold up in production:

Keep total spoken words short. Eight seconds is roughly two short exchanges. Cramming a monologue in produces rushed, garbled delivery.
Always quote the exact line. Topic-only prompts ("they discuss the project") let the model invent words you cannot control.
Add delivery in parentheses. "(whispering)", "(frustrated)", "(deadpan)" measurably shift the read.

Template 4: How do I structure a 30-second ad as stitched clips?

You cannot generate a coherent 30-second narrative in one Veo 3 prompt — the clip ceiling is 8 seconds. You build longer pieces by generating shorter clips and stitching them. JSON shines here because the whole campaign lives in one auditable document.

{
  "campaign": "Spring Skincare Launch",
  "shots": [
    {
      "shot_id": "01_hook",
      "duration_seconds": 6,
      "subject": "MAYA — 32, curly auburn hair, freckles, green eyes — holding a glass skincare bottle to morning light",
      "action": "turns bottle slowly, light catches the liquid, soft smile breaks",
      "scene": "minimalist bathroom, white tile, soft morning light from window",
      "camera": "medium close-up, slow push in, 50mm",
      "lighting": "soft window light from camera-left, golden warm",
      "audio": {
        "dialogue": "MAYA (V.O., warm): \"It started with a question.\"",
        "ambience": "morning birds, soft water trickling",
        "score": "gentle piano building"
      }
    },
    {
      "shot_id": "02_problem",
      "duration_seconds": 6,
      "subject": "MAYA (same lock) at vanity mirror, examining her face with a disappointed micro-expression",
      "action": "leans in close to mirror, sighs, drops shoulders",
      "scene": "same bathroom, slightly different angle, mirror dominant",
      "camera": "medium, mirror reflection, 35mm, static",
      "lighting": "honest natural light, no flattering tricks",
      "audio": {
        "dialogue": "MAYA (V.O.): \"Why does my skin react to everything?\"",
        "ambience": "muted, tense quiet",
        "score": "piano pauses, single held note"
      }
    },
    {
      "shot_id": "03_solution",
      "duration_seconds": 8,
      "subject": "Bottle with brand label rotating on white surface, ingredient overlay text appearing",
      "action": "bottle rotates 180 degrees, label fully readable, ingredient names fade in over white",
      "scene": "studio white seamless backdrop",
      "camera": "macro lens, rotating subject, continuous",
      "lighting": "even soft studio light, no shadows",
      "audio": {
        "dialogue": "MAYA (V.O., relieved): \"Three ingredients. Nothing else.\"",
        "ambience": "studio quiet",
        "score": "uplifting orchestral swell, building"
      }
    },
    {
      "shot_id": "04_resolution",
      "duration_seconds": 6,
      "subject": "MAYA (same lock, same outfit), smiling genuinely now, applying product",
      "action": "applies product to cheek, smiles into mirror, satisfied",
      "scene": "same bathroom, golden hour now, warm and bright",
      "camera": "medium close-up, slow pull back, 50mm",
      "lighting": "warm golden hour light, hopeful",
      "audio": {
        "dialogue": "MAYA (V.O.): \"Finally. Skincare that listens back.\"",
        "ambience": "morning warmth, soft ambient",
        "score": "score resolves to held warm chord, brand sting"
      }
    }
  ],
  "post_production": {
    "stitch": "edit shots in order, dissolve 0.5s between each",
    "color_grade": "warm golden, slightly lifted shadows, brand-aligned palette",
    "end_card": "logo + brand URL, 2 seconds"
  }
}

The narrative arc here is deliberate: hook, problem, solution, resolution. Each shot is independently generatable, the post_production block tells your editor (or you) how to assemble them, and the voiceover threads continuity even when visuals cut. Note the per-shot durations — 6, 6, 8, 6 — adding to 26 seconds of footage that trims to a clean 30 with the end card.

If you want continuity within a single generation rather than across stitched clips, Veo 3.1 also supports timestamp prompting, where you label beats by time range inside one prompt. The Google Cloud guide shows the pattern:

[00:00-00:02] Medium shot from behind explorer pushing jungle vines aside.
[00:02-00:04] Reverse shot of explorer's face showing awe. SFX: rustle of leaves, distant bird calls.
[00:04-00:06] Tracking shot following explorer's hand over stone carvings.
[00:06-00:08] Wide crane shot revealing temple complex. SFX: swelling orchestral score begins.

That is natural-language, not JSON, and it is the right tool for a single multi-beat clip. Use stitched JSON shots for anything longer than 8 seconds; use timestamp prompting inside one clip when you want guaranteed cuts without an editor.

Template 5: How do I build a reusable B-roll texture pack?

B-roll is the connective tissue of any edit — abstract pours, textures, motion fills. These are where variable templates pay off most, because you want ten variations of the same idea.

{
  "shot_id": "broll_01",
  "subject": "abstract liquid pour macro",
  "action": "thick honey-colored liquid pours slowly into a clear glass vessel, ripples expanding",
  "scene": "studio, white seamless backdrop",
  "camera": {
    "framing": "extreme macro",
    "lens": "100mm macro",
    "movement": "static, locked"
  },
  "lighting": "soft top light, slight side rake to reveal viscosity",
  "audio": {
    "dialogue": "none",
    "ambience": "subtle pour gurgle",
    "score": "none"
  },
  "duration_seconds": 6,
  "aspect_ratio": "16:9"
}

Now parameterize it. Generate ten variants by swapping three placeholders:

{{liquid_color}} — honey-amber, milk-white, deep crimson, matte black ink
{{vessel_type}} — clear glass, frosted beaker, shallow dish, tall flute
{{lighting_angle}} — top rake, hard side, backlit, soft wraparound

Run the matrix and you have a small B-roll library from one template. Because each clip is dialogue-free and short, this is also the cheapest content to produce per second — ideal for testing your pipeline before you spend on hero shots.

Template 6: How do I get a stylized, non-photoreal look?

Veo 3 has a strong photoreal bias. You can push it toward illustration, anime, or graphic styles, but you have to anchor the style explicitly.

{
  "subject": "anime-style young woman with long pink hair, large green eyes, wearing white school uniform with red bow",
  "action": "stands on cliff edge, wind blowing her hair, looks toward distant city below, single tear rolls down cheek",
  "scene": {
    "location": "cliff overlooking neon-lit cyberpunk city at night",
    "time": "midnight",
    "weather": "light rain, atmospheric mist"
  },
  "camera": {
    "framing": "medium wide, eye-level, slight tilt up",
    "lens": "anime aesthetic, soft focus background",
    "movement": "slow camera pull back"
  },
  "lighting": "neon city glow from below in pink and cyan, moonlight rim from above",
  "audio": {
    "dialogue": "none",
    "ambience": "rain on cliff, distant city hum, wind",
    "score": "melancholic synthwave, slow tempo, emotional"
  },
  "style_anchor": "anime, illustrated 2D, hand-drawn cel look, painterly backgrounds",
  "duration_seconds": 8,
  "aspect_ratio": "16:9"
}

The style_anchor field is doing heavy lifting. Google's guide explicitly calls out style as one of its seven core prompt elements, listing looks like cartoon, claymation, film noir, and VHS. Naming a recognizable style anchor — and reinforcing it in lens — keeps Veo 3 from defaulting to photoreal mid-clip.

Honest caveat: For anime-heavy or highly stylized projects, dedicated stylization in Kling often holds the look more tightly than Veo 3, which fights its photoreal training. Test both for any project where the art style is the whole point.

How do I turn templates into a production pipeline with variables?

The reason to write JSON at all is leverage: build the template once, generate many. Use placeholder syntax in a skeleton and inject values at runtime.

{
  "subject": "{{character_name}}, {{age}}, {{hair_description}}, wearing {{wardrobe}}",
  "action": "{{primary_action}}",
  "scene": {
    "location": "{{location}}",
    "time": "{{time_of_day}}"
  },
  "camera": "{{camera_directive}}",
  "lighting": "{{lighting_setup}}",
  "audio": {
    "dialogue": "{{character_name}} (V.O., {{tone}}): \"{{line}}\"",
    "ambience": "{{ambience}}",
    "score": "{{score_mood}}"
  },
  "duration_seconds": "{{duration}}",
  "aspect_ratio": "{{aspect}}"
}

Pair the skeleton with a values table and you have a generation matrix. This is where a reusable prompt library with global variables stops being a nice-to-have and starts saving real hours — you store the character lock as a variable once and reference it everywhere.

A simple variant matrix for a social campaign might look like this:

Variant	location	time_of_day	tone	line
A	rooftop garden	golden hour	warm	"Mornings feel different now."
B	sunlit kitchen	early morning	curious	"What if it was actually simple?"
C	quiet studio	blue hour	reflective	"Three ingredients. That's it."

One skeleton, three on-brand clips, zero structure re-typing. Scale that to fifty rows and you have a content week from a single template.

What are the most common Veo 3 JSON prompt mistakes?

These are the errors that quietly degrade output even when the JSON is valid.

Missing or empty audio. Native audio is roughly half of Veo 3's value. Leaving the audio block out makes the model invent a generic soundscape. Always specify at least ambience and score mood.
Skipping character re-description in multi-shot. Veo 3 carries no context between prompts. Reference-only ("MAYA does X") with no description means the model invents a new MAYA every time. Paste the full lock into every shot, and add an Ingredients reference image for faces that must match.
Vague camera directives. "Cinematic camera" or "nice angle" produces random results. Always give three things: framing, lens, movement.
No lighting direction. "Soft lighting" is generic. State source, direction, and temperature — "warm key from camera-left, cool rim from behind."
Attempting 30+ second narratives in one prompt. The clip ceiling is 8 seconds. Stitch shorter clips or use Scene Extension; do not fight the limit.
Treating JSON keys as magic. The keys help you stay complete, but Veo 3 reads the values as language. Weak values inside a perfect schema still produce weak video. Write each value as if it were a sentence.
Over-stuffing one shot. Three actions, two speakers, and a complex camera move in 6 seconds will smear. One clear beat per clip.

What does Veo 3 cost, and how does that change how you prompt?

Cost discipline is part of prompt craft, because Veo 3 bills by the second. Per the Vertex AI generative pricing page and tier reporting from the Gemini API developer forum, Veo 3.1 is offered in Lite, Fast, and Quality tiers, with the Fast tier in the rough range of $0.10–$0.15 per second and Quality around $0.20–$0.40 per second with audio at 1080p as of mid-2026. Exact rates vary by region, resolution, and whether audio is enabled, so always confirm against the live pricing page before budgeting.

What that means for prompting:

Render reference frames first. Generate a cheap 4-second or Lite-tier test of any hero shot before committing to a full 8-second Quality render. Catch framing and lighting problems when they cost cents, not dollars.
Use Lite for B-roll and iteration. Dialogue-free texture clips do not need the Quality tier. Reserve top-tier spend for hero shots and final deliverables.
Template to avoid re-rolls. Every regenerated clip is paid again. A well-tested JSON template that nails the shot on the first or second try is a direct cost saving, not just a convenience.

This is also why the discipline of testing prompts before scaling matters more for video than for text — the cost per output is orders of magnitude higher.

How should I organize Veo 3 templates over time?

The creators who ship consistently do not improvise every shoot. They build a library.

Version your templates. Name them hero_v3.json, dialogue_v2.json. When output regresses, diff the JSON like code to find what changed.
Lock references for character series. Pair a text lock with an Ingredients reference image and keep both in the template so any teammate can reproduce the character.
Render reference frames before full renders. A 1-second or short test render saves a full-length re-roll.
Build twenty reliable templates. Twenty battle-tested templates covering your real recurring needs beat improvising from scratch every time. Hero, multi-shot, dialogue, ad, B-roll, stylized — start with the six in this article and grow from there.

Tools that ship Veo 3 JSON templates as one-click presets, like Prompt Architects, remove the structure-typing for repeated work and let you store character locks as global variables. The six templates above transfer directly — paste them into your tool of choice, swap the placeholders, and start rendering.

What should I do next?

Pick one template that matches your most common need — hero, dialogue, or ad.
Customize it with your character, scene, and brand details.
Render a short reference first to check framing and lighting cheaply.
Render three variants at full length and pick the best.
Save it as your starter template and version it.
Build toward a ten-template library over the next month, adding an Ingredients reference image for any recurring character.

JSON does not make Veo 3 smarter. It makes you complete — and completeness, render after render, is what separates a polished sequence from a lucky one-off.

Frequently asked questions

Why use JSON prompts instead of natural language for Veo 3? JSON prompts give you reuse, reproducibility, and templating. Define a character object once and reference it across ten shots, swap variables for variant generation, and diff prompts like code when output regresses. For one-off cinematic shots, natural language is fine. For series, ads, or multi-shot narratives, structured JSON wins because it forces you to specify every dimension Veo 3 cares about.

Does Veo 3 actually parse JSON syntax literally? Not literally — Veo 3 is trained on natural-language prompts, and Google's own guide recommends descriptive cinematic sentences. But the model reliably extracts the semantic content from well-formed JSON, treating keys as anchors. JSON's real value is for you: it forces completeness and makes templating possible. Paste plain JSON into the prompt field; the keys map to the same concepts as prose.

What fields matter most in a Veo 3 JSON prompt? Subject (explicit physical descriptors and wardrobe), camera (framing, lens, movement), lighting (source, direction, mood), audio (dialogue, ambience, score), and scene (location, time, weather). Action is also critical. Google's five-part formula is Cinematography + Subject + Action + Context + Style/Ambiance, and your JSON keys should cover all five.

How do I keep characters consistent across 5+ shots in Veo 3? Two layers. First, define a character object once with full physical and wardrobe descriptors and re-describe it verbatim in every shot, because Veo 3 does not carry context between prompts. Second, use Veo 3.1's Ingredients to Video feature to supply a reference image of the character, which locks face and styling far more reliably than text alone.

What clip lengths and resolutions does Veo 3 support? Veo 3.1 generates 4, 6, or 8-second clips at 720p or 1080p in 16:9 or 9:16, with native synchronized audio. Longer narratives are built by stitching clips or using Scene Extension. Some tiers add 4K upscaling on Vertex AI and the Gemini API. Do not attempt 30-second narratives in a single prompt.

Should I include audio in every Veo 3 shot? Yes. Native synchronized audio is Veo 3's biggest differentiator over earlier models. Even on dialogue-free shots, specify ambience and score mood. Skipping audio cues forfeits the model's edge. Bare minimum on any clip: ambience plus a one-line score direction, using quotation marks for any spoken dialogue.

How much does Veo 3.1 cost to generate video? Pricing is per second of output and varies by tier. As of mid-2026, Veo 3.1 Fast runs roughly $0.10–$0.15 per second and the Quality tier runs around $0.20–$0.40 per second with audio at 1080p on the Gemini API and Vertex AI. A Lite tier exists for cheap drafts. Because cost scales with seconds, test with short reference renders before committing to full 8-second hero shots.

Can I reuse one JSON template to generate many video variants? Yes, and this is the main reason to template. Build a JSON skeleton with placeholders like {{location}}, {{wardrobe}}, and {{line}}, then inject values at runtime. One well-tested template can generate fifty on-brand variants without you re-typing structure. Tools like Prompt Architects ship Veo 3 JSON templates as one-click presets so you skip the boilerplate entirely.

By Nafiul Hasan — Founder of Prompt Architects, builder of prompt-enhancement tooling for ChatGPT, Claude, Gemini, Veo 3, and Kling, writing from hands-on production testing of structured video prompts. Last updated: June 10, 2026.

JSON Video Prompt Templates for Veo 3 (Production-Ready, 2026)

What is a JSON video prompt for Veo 3, and why use one?

How does Veo 3 actually read a prompt?

The field-to-formula map

What are the core fields every Veo 3 JSON prompt needs?

Template 1: How do I write a hero shot in JSON? (single 8-second clip)

Template 2: How do I keep a character consistent across many shots?

Template 3: How do I write a dialogue scene in JSON?

Template 4: How do I structure a 30-second ad as stitched clips?

Template 5: How do I build a reusable B-roll texture pack?

Template 6: How do I get a stylized, non-photoreal look?

How do I turn templates into a production pipeline with variables?

What are the most common Veo 3 JSON prompt mistakes?

What does Veo 3 cost, and how does that change how you prompt?

How should I organize Veo 3 templates over time?

What should I do next?

Frequently asked questions

Frequently asked questions

Stop rewriting prompts. Start shipping.

Keep reading

Veo 3 Prompt Structure: Complete Guide with 25 Examples (2026)

Free Veo 3 Prompt Generator: Cinematic Templates Inside (2026)

Veo 3 vs Sora vs Kling: Which AI Video Model Wins in 2026?