TL;DR: Cinematography is a directable skill, and modern AI video models respect technical terms. Master six fundamentals — framing, lens, light source and direction, camera movement, subject motion, and palette — and your output looks intentional rather than generated. Google's own Veo guide confirms that professional camera, lens, and lighting vocabulary translates directly into footage.
How do you prompt AI video like a filmmaker?
To prompt AI video like a filmmaker, you direct six cinematography fundamentals explicitly in every shot: framing, lens, lighting source and direction, camera movement, subject motion, and color palette. Modern models — Veo 3.1, Kling 3.0, and Sora 2 — parse professional film terms like "35mm lens," "low-angle dolly in," and "golden-hour side light" directly into output, so specifying them turns generic clips into directed footage.
That direct answer is the whole thesis of this guide. The rest is execution. Below, you will learn exactly which terms to use, why they work, how the three major 2026 models differ, and how to assemble a full filmmaker-grade prompt you can copy, paste, and adapt. Throughout, the goal is the same: stop typing scene descriptions and start writing shot lists.
The difference is night and day. "A woman walks through Paris" produces stock-footage feel. The same idea with framing, lens, lighting, motion, and palette specified produces output that reads like a film cut. And the good news that frames everything here is that you do not need a film degree — six fundamentals plus a reference vocabulary cover most decisions a working director makes on set.
Why does most AI video look generic?
Most AI video looks generic for one reason: the prompt describes what is in the shot but never how it is shot. The model is left to guess the camera, the lens, the light, and the grade — and its default guesses are flat, evenly lit, mid-distance, and statically framed. That is the visual language of stock footage, not film.
A camera is not a neutral recording device. Every real production makes dozens of deliberate choices before a single frame rolls — where the key light sits, which lens compresses the background, whether the camera observes or moves with the subject. When you skip those choices in a prompt, you are not getting "no style." You are getting the model's average of everything it has ever seen, which is exactly what average looks like.
This matters more in 2026 than it did even a year ago, because the models got dramatically better at honoring direction. Google's Veo 3.1 prompting guide explicitly lists dolly shots, crane shots, wide-angle lenses, shallow depth of field, and macro lenses as terms that "translate directly into the generated footage." The capability is there. The limiting factor is now you — your willingness to specify.
The fix is structural, not creative. You do not need a better imagination. You need a checklist. Run every prompt through six fundamentals and the generic problem largely disappears.
What are the six cinematography fundamentals for AI video?
The six fundamentals are framing, lens, lighting (source plus direction), camera movement, subject motion within the frame, and color palette or grade. Together they determine roughly 80% of how cinematic a shot feels. Specify all six and your output looks directed. Skip any one — especially lighting — and the shot drifts back toward generic.
Here is the map before we go deep on each one:
| # | Fundamental | The decision it makes | Cost of skipping it |
|---|---|---|---|
| 1 | Framing | How much subject and environment are in frame | Mid-distance default; no emphasis |
| 2 | Lens | Spatial relationship between subject and background | Flat perspective; no depth language |
| 3 | Lighting | The single biggest driver of mood and realism | Flat, evenly lit, video-not-film look |
| 4 | Camera movement | Whether the camera observes or participates | Static or random drift |
| 5 | Subject motion | What the people and objects actually do | Stiff, frozen, or unmotivated action |
| 6 | Palette / grade | Emotional temperature of the image | Muddy, neutral color; no tone |
Memorize this table and you have the spine of every prompt. Now let's take each fundamental in order.
How does framing change an AI video shot?
Framing decides how much of the subject and environment occupies the frame, and it carries meaning before anything else happens. A wide shot says "look at this world." A close-up says "look at this person's eyes." Choose framing first because it sets the emotional distance between viewer and subject.
AI video models recognize the standard shot vocabulary, so use it precisely:
| Framing | What it shows | Use when |
|---|---|---|
| Extreme wide shot (EWS) | Vast environment; subject tiny | Establish scale and geography |
| Wide shot (WS) | Full subject in environment | Establish setting plus character |
| Medium shot (MS) | Subject from waist up | Conversation, action |
| Medium close-up (MCU) | Subject from chest up | Default narrative; intimacy without claustrophobia |
| Close-up (CU) | Subject's face fills the frame | Emotion, key moments |
| Extreme close-up (ECU) | Eyes or detail only | Heightened emotion, key object |
| Two-shot | Two subjects in frame | Dialogue scenes |
| Over-the-shoulder (OTS) | One subject's shoulder plus another's face | Conversation reverse-angle |
Two rules keep framing clean. First, pick exactly one framing per shot — "wide shot close-up" is a contradiction that confuses the model. Second, let framing follow intent: if the moment is about a decision on someone's face, you are in CU or ECU; if it is about where they are, you are in WS or EWS. The medium close-up is your safest default for narrative because it reads intimate without feeling trapped.
Which lens should you specify, and why?
The lens decides the spatial relationship between your subject and the background — how compressed, how deep, how much the viewer feels they are standing in the scene versus watching it from a distance. Specifying a focal length like "35mm" produces meaningfully different output than no lens at all, because the model maps focal length to perspective and depth of field.
Two physical facts drive every lens choice. Longer focal lengths compress the background and isolate the subject; shorter ones exaggerate depth and pull the environment in. And depth of field shrinks fast as focal length grows — when you double focal length, depth of field drops to roughly a quarter of what it was, not half, which is why long lenses give that creamy, isolated portrait look.
| Lens | Effect | Use for |
|---|---|---|
| 24mm wide | Strong perspective; subject large relative to background | Establishing, vast scenes, hero shots |
| 35mm standard | Natural perspective, mild depth; "how the eye sees" | Default for most scenes |
| 50mm portrait | Slight compression, near human-eye view | Conversations, mid-range narrative |
| 85mm telephoto | Compressed background, shallow depth | Intimate portraits, isolation |
| 135mm long | Heavy compression, very shallow depth | Editorial portraits, voyeur feel |
| Macro | Extreme close detail | Product shots, texture work |
| Anamorphic | 2.35:1 widescreen, oval bokeh, horizontal lens flare | Cinematic blockbuster feel |
A practical anchor: the 35mm "invites you to show the whole scene with both subject and environment," while the 85mm range (roughly 75–85mm on full frame) is the most perceptually natural focal length for a medium close-up, where faces look undistorted and the background falls away gently. The American Society of Cinematographers notes that large-format and lens choice fundamentally change how depth and field of view read on screen, which is exactly the relationship you are exploiting in a prompt.
If you remember nothing else: 24–35mm for "show me the world," 50mm for "natural conversation," 85mm and up for "isolate this person." Add the word "anamorphic" whenever you want that widescreen, oval-bokeh, horizontal-flare blockbuster signature.
How do you light AI video like a cinematographer?
Lighting is the half of the look most amateur prompts skip entirely, and it is the single biggest lever for making AI video read as film rather than video. Always specify two things together: a source (where the light comes from and what kind) and a direction (where it strikes the subject from). "Soft light" is generic; "soft north-window key from the left, low fill" is directable.
Professional lighting is built on a foundation worth understanding because it gives you the vocabulary to direct it. The classic three-point setup uses a key light (the dominant source that shapes the subject), a fill light (a softer source on the opposite side that controls shadow depth), and a back or rim light (placed behind the subject to separate them from the background and add a three-dimensional edge). A common starting ratio assigns roughly 50% of the light to the key, 30% to the fill, and 20% to the back. You can prompt this directly: "three-point lighting, strong key from camera-left, soft fill from camera-right, rim light separating subject from a dark background."
Sources to specify
- Natural daylight — soft, overcast, or harsh
- Golden hour — warm, low-angle sun
- Blue hour — twilight cool tones
- Streetlamps and practicals — warm pools in darkness
- Studio softbox — even and controlled
- Ring light — flat fashion lighting
- Single window — directional natural light
- Candlelight or firelight — warm, intimate, flickering
- Neon — saturated, mixed colors
- Mixed warm and cool — golden hour plus streetlamp blue (a cinematic favorite)
Directions to specify
- Front-lit — flat and even, often dull
- Side-lit — dimensional and dramatic
- Backlit — silhouette or rim glow
- Top-lit — theatrical, sometimes ominous
- Underlit — uncanny and otherworldly
- 3/4 key plus fill — the classic portrait setup
Combinations that reliably look cinematic
| Source + direction | Result |
|---|---|
| Golden hour + side-lit | Warm, dimensional, modern cinematic |
| Single window + side-lit | Vermeer-style interior portrait |
| Mixed neon + top-lit | Cyberpunk street mood |
| Candlelight + soft 3/4 | Caravaggio-style chiaroscuro drama |
| Studio softbox + front-lit | Clean, neutral commercial look |
| Hard backlight + atmospheric haze | Silhouette and god-rays |
Notice how Google's own example prompt leans on this exact discipline: a worker "lit by the harsh fluorescent overhead lights and the green glow of the monochrome monitor" is a precise, two-source, directional lighting description, not the word "office." Match that level of specificity and your shots stop looking flat. If you take one habit from this entire article, make it this: never submit a video prompt without a lighting block.
What camera movement should you choose?
Camera movement decides whether the camera is an observer (static) or a participant (moving with the action), and that choice changes the entire emotional register of a shot. For most narrative work, default to static or a slow push-in. For kinetic or action work, reach for tracking or handheld. Pick one movement per shot and let it serve the moment.
| Movement | Effect | Use for |
|---|---|---|
| Static / locked-off | Observational, formal | Establishing, dramatic stillness |
| Slow push-in (dolly in) | Increasing intimacy and tension | Reveals, emotional builds |
| Pull-out (dolly out) | Releasing, contextualizing | Resolution, scope reveals |
| Tracking / following | Moving with the subject | Walks, runs, journeys |
| Pan | Horizontal sweep | Reveals, location coverage |
| Tilt | Vertical sweep | Architecture, scale |
| Crane up / down | Vertical lift | Establishing, transitions |
| Orbit / arc | Circular around the subject | Emphasizing a character |
| Whip pan | Fast horizontal | Energetic transitions |
| Handheld | Subjective, intimate, slightly unstable | Documentary, raw moments |
| Steadicam / gimbal | Polished, smooth motion | Long takes, narrative flow |
| Crash zoom | Sudden focal-length change | Dramatic emphasis |
All of these parse in modern models — Google explicitly lists dolly, tracking, crane, aerial, slow pan, and POV shots as supported camera moves. The single most useful pairing for emotional scenes is "slow dolly in," because it manufactures tension purely through camera language. Use whip pans and crash zooms sparingly; they are seasoning, not the meal.
One discipline to hold onto here: keep camera movement separate from subject movement in your prompt. They are different fundamentals (numbers 4 and 5), and conflating them — "the camera walks" — produces muddy results. The camera tracks; the subject walks.
How do you direct subject motion within the frame?
Subject motion is the action and behavior of the people and objects in the shot, distinct from how the camera moves. AI video often renders subjects stiff or frozen because the prompt never told them what to do. Specify a clear action beat — a glance, a smile that fades, a hand reaching — and the shot gains life.
| Motion type | What to specify |
|---|---|
| Walk | Pace (brisk, slow, ambling) and gait |
| Stand | Posture (relaxed, tense, alert) and micro-movements |
| Sit | Setting and lean (back, forward, slouch) |
| Action beat | A single discrete action (look, reach, smile, turn) |
| Continuous motion | Sustained activity (running, dancing, working) |
| Environmental motion | Wind, water, smoke, fabric — independent of the subject |
The pro move is the single, motivated action beat. Instead of "a woman stands in a doorway," write "a woman stands in a doorway, then turns her head toward an off-screen sound and her expression shifts from calm to alert." That one beat gives the model a clear arc to animate across the clip's few seconds, and it is exactly the kind of micro-direction that separates a living shot from a frozen one.
Environmental motion is the secret weapon for realism. Wind in hair, fabric flutter, drifting smoke, rain hitting a surface, steam rising — these cues read as physical reality, and 2026 models are notably better at them. Kling 3.0's physics simulation handles fabric drape, fluid motion, and particle behavior with the most realistic results of any current video model, so it rewards you for asking for environmental motion explicitly.
How do you set the color palette and mood?
The palette, or color grade, decides the emotional temperature of the image. "Cinematic" alone is too generic to be useful; instead, name a specific palette or anchor it to a film or cinematographer the model recognizes. Color is the fastest way to make two technically identical shots feel like different films.
| Palette | Mood | Reference point |
|---|---|---|
| Warm gold + cool blue | Cinematic contrast | Most modern blockbusters |
| Desaturated muted | Bleak, serious | Prestige drama, thrillers |
| High saturation | Energetic, playful | Vintage, storybook |
| Monochromatic blue | Cold, clinical | Sci-fi |
| Sepia / warm vintage | Nostalgic | Period pieces |
| Pastel | Soft, dreamlike | Romantic, ethereal |
| High-contrast black & white | Stark, dramatic | Noir, art film |
| Neon noir | Saturated city night | Cyberpunk |
| Earth tones | Grounded, naturalistic | Documentary |
When you cannot articulate a look, borrow one. Naming a director or cinematographer anchors palette and lighting in a single phrase, because the model associates the name with a coherent visual signature. Useful names that the models recognize include David Fincher (cool, desaturated, precise), Wes Anderson (symmetrical, pastel, storybook), Denis Villeneuve (vast, atmospheric, monolithic), Roger Deakins (naturalistic light), Emmanuel Lubezki (fluid natural-light long takes), Bradford Young (warm, low-key, soft), and Hoyte van Hoytema (textured, large-format). Use the reference as one ingredient, not the whole recipe — pair "Fincher palette" with your own framing, lens, and lighting blocks for control.
How do you assemble a full filmmaker-grade prompt?
You assemble a complete prompt by writing all six fundamentals as labeled blocks, then ordering them roughly the way Google's official Veo formula recommends: [Cinematography] + [Subject] + [Action] + [Context] + [Style & Ambiance], with an explicit audio line for models that support native sound. Labeling each block keeps the model from blending or dropping your direction.
Here is a complete, copy-pasteable example — a cinematic, intimate portrait shot:
Framing + Lens: Medium close-up, 35mm anamorphic lens, slight low angle
to emphasize her stride.
Subject: A 30-year-old woman with curly red hair, light freckles, wearing
a charcoal wool coat, holding a leather portfolio.
Action: Walking briskly across wet cobblestone, glancing back over her
shoulder once mid-walk; a slight smile fades to neutral.
Context: Paris, autumn dusk in late October, light rain falling, Notre
Dame in soft-focus background, lamp posts glowing, atmospheric haze.
Lighting: Golden-hour warm key from the west, mixing with cool blue fill
from streetlamps. Soft side rim light from her right. Atmospheric haze
diffuses the background.
Camera Motion: Smooth gimbal tracking shot from her right side, moving at
her walking pace, with a slight handheld feel for intimacy.
Subject Motion: Brisk walk, hair moves with motion, coat flutters
slightly; the glance over the shoulder is a discrete beat — pause, then
return forward.
Style / Palette: 35mm film grain, warm gold + cool blue contrast,
cinematic palette inspired by Fincher's atmospheric work. Anamorphic lens
flare from a streetlamp.
Audio (Veo 3.1): Footsteps on wet cobblestone, distant city traffic,
faint church bells, sparse melancholic piano. No dialogue.
That is filmmaker-level direction. Every one of the six fundamentals is present and specified, the structure follows Google's recommended order, and the audio line is broken out for native-audio models. The output reads as a cut from a film, not as generic AI video. If you want a leaner version, the same prompt compresses into a single dense paragraph — but the discipline of writing the blocks first is what guarantees nothing gets dropped.
For repeatable production, save this structure as a reusable scaffold. A tool like Prompt Architects lets you store this six-block skeleton with Global Variables for subject, lighting, and palette, so you fill in the blanks instead of retyping the whole structure for every shot. The cinematography skill is what makes the output look directed; the tool just removes the typing.
How do Veo 3, Kling, and Sora differ for cinematic prompting?
The three leading 2026 models share the same cinematography vocabulary but differ in clip length, audio, resolution, and physics. Veo 3.1 leads on native synchronized audio and prompt-following, Kling 3.0 leads on physics realism and native 4K, and Sora 2 leads on clip duration. Knowing the differences lets you route each shot to the right model.
| Capability | Veo 3.1 | Kling 3.0 | Sora 2 |
|---|---|---|---|
| Clip length | 4–8 seconds | Up to ~15 seconds | 10–25 seconds |
| Max resolution | Up to 4K | Native 4K, up to 60fps | 1080p (Full HD) |
| Native audio | Yes — dialogue, SFX, ambient | Multi-language audio + lip-sync | Yes — synchronized audio |
| Standout strength | Prompt-following + audio sync | Physics realism, Motion Brush | Clip duration, character cameos |
| Best for | Dialogue and sound-driven scenes | Image-to-video, VFX-grade motion | Longer single-shot narratives |
A few details worth knowing. Veo 3.1 generates joint audio and video so that footsteps match movement and dialogue syncs to lips — put direct dialogue in quotation marks and label SFX and ambient lines clearly. Kling 3.0, released in February 2026, is the first model to produce native 4K at 60fps with single clips up to 15 seconds, and its Motion Brush lets you draw a literal motion path on a still frame for directorial control. Sora 2 extends generation to 10–25 seconds with synchronized audio at up to 1080p, which makes it the pick when you need a longer single take.
Practical routing: storyboard and sound-design a dialogue scene in Veo 3.1; animate a Midjourney-generated cinematic still in Kling 3.0; produce a longer continuous establishing shot in Sora 2. The six fundamentals transfer across all three — only the strengths change.
What's the best workflow for multi-shot AI video?
The best multi-shot workflow locks subject and style on a single hero shot, then reuses those exact modifiers across every other shot to maintain character and world consistency. Storyboard before you prompt, generate one reference shot to perfection, then vary only action, framing, and camera movement per cut.
Four patterns cover almost every project:
- Storyboard before prompting. Sketch or describe each shot before you write a single prompt. Five shots times three minutes of storyboarding saves thirty minutes of regeneration. You catch continuity problems on paper, where fixing them is free.
- Lock subject and style first. Generate one hero shot, tuning subject, wardrobe, lighting, and palette until it is right. Then copy those blocks verbatim into every subsequent shot so the character and world stay consistent across cuts.
- Use consistency features. Veo's JSON character mode locks subject details across shots, and Kling 3.0's multi-shot scene logic keeps characters consistent across cuts with correct occlusion — if a character walks behind a tree, they emerge with the same face and clothing intact, per Kling's 2026 physics improvements.
- Work reference-driven. Find three stills from films you love that match your target look, reverse-engineer the six fundamentals from each, and reuse the common modifiers. This is the fastest way to develop a coherent house style.
For longer pieces, think in beats. A 60-second video is not one prompt; it is eight to twelve shots, each a separate generation, stitched in an editor. Plan the cut, then prompt the shots. Our deeper walkthrough on building consistent sequences lives in the AI video workflow guide, and the reverse-engineering method is covered in how to reverse-engineer prompts from images.
What are the most common filmmaker mistakes in AI video?
The most common mistakes are skipping the lighting block, mixing framings, omitting the lens, conflating camera and subject motion, vague palettes, frozen subjects, and trying to script dialogue. Each one pulls output back toward generic. Run this list as a pre-flight check before every generation.
- No lighting block. Half the look, skipped. Always specify a source and a direction.
- Mixed framing. "Wide close-up" is meaningless. Pick exactly one framing per shot.
- No lens specified. "35mm" produces different depth and perspective than no lens.
- Subject motion equals camera motion. Specify each separately; the camera tracks, the subject walks.
- Vague palette. "Cinematic" is filler. Name a specific palette or a recognizable reference.
- Frozen subject. If you do not give a motion-within-frame beat, the subject just stands there. Add a glance, a smile, a slight head turn.
- Scripting dialogue as narrative. AI video does not render extended scripted dialogue reliably. Specify actions, and route spoken lines through the audio block in quotation marks (Veo 3.1, Sora 2), not as a screenplay.
If you fix only the first and the sixth — lighting and subject motion — you will close most of the quality gap between "obviously AI" and "looks directed." Those two are where amateur prompts bleed the most.
How do you build a reusable reference vocabulary?
You build a reference vocabulary by collecting a short cheat-sheet of director and film names that anchor a specific look, then using one per shot as a single style ingredient. When you cannot describe a look from scratch, naming a recognizable visual signature gives the model a coherent target for palette and lighting at once.
| When you want… | Reference to use |
|---|---|
| Cinematic blockbuster contrast | "warm gold + cool blue, mixed temps" or "blockbuster palette" |
| Symmetrical pastel storybook | "Wes Anderson style" |
| Vast atmospheric sci-fi | "Denis Villeneuve atmospheric, vast scale" |
| Handheld intimate realism | "handheld, naturalistic" |
| Natural-light realism | "Roger Deakins natural light" |
| Cool desaturated thriller | "Fincher palette, desaturated" |
| Long-take fluid camera | "Lubezki style, long take" |
| Warm intimate naturalism | "Bradford Young warm low-key" |
| Luminous anime skies | "Makoto Shinkai luminous skies" |
| Pastoral hand-drawn warmth | "Studio Ghibli aesthetic" |
These references work across Veo 3.1, Kling 3.0, and Sora 2 with varying strength. Treat them as one component of a prompt, never the entire style anchor — "Fincher palette" plus your own framing, lens, and lighting blocks gives you both the recognizable signature and full control. Over time, keep your personal list of ten to fifteen scaffolds for the shot types you make most. That library is what turns occasional good results into a repeatable house style. If you want to go deeper on lighting language specifically, our AI lighting prompt guide breaks down every source and direction with examples.
What changed for cinematic AI video in 2025–2026?
Three shifts redefined cinematic AI video between 2025 and 2026: native synchronized audio arrived, resolution and clip length jumped, and physics realism improved enough to handle occlusion and fabric. Together they closed gaps that previously forced creators to fix sound, length, and motion in post.
- Native audio became standard. Veo 3 introduced joint audio-visual generation in May 2025, with synchronized dialogue, SFX, and ambient sound — closing the sound-design gap inside the prompt itself. Sora 2 and Kling 3.0 followed with their own synchronized-audio pipelines.
- Resolution and length climbed. Kling 3.0 shipped native 4K at 60fps in February 2026, and Sora 2 extended clips to 25 seconds with audio, giving directors longer single takes to work with.
- Physics got real. The 2026 generation of Kling maintains structural integrity through occlusion and renders fabric drape, fluid motion, and particle behavior convincingly — the kinds of details that previously screamed "AI."
- Prompt-following tightened. All three models now parse cinematography vocabulary far more reliably than their 2024 predecessors, which is precisely why the six-fundamentals approach pays off more than ever.
The takeaway: the tools have caught up to film vocabulary. The constraint is no longer the model's ability to render a dolly-in under golden-hour side light — it is whether your prompt asks for it.
What should you do next?
Do four things to convert this guide into a skill. Each is small, and together they build the habit that makes every future prompt better.
- Pick a film scene you love. Pause it and reverse-engineer its six fundamentals: framing, lens, lighting, camera movement, subject motion, and palette. Write them down.
- Write a prompt with all six specified. Use the labeled-block structure from the assembly section. Generate it in whichever model fits the shot.
- Compare it to your previous output. Note the lift. The gap between this and "a woman walks through Paris" is the entire point of the method.
- Build your reference library. Save ten to fifteen prompt scaffolds for the shot types you use most, with reusable variables for subject, lighting, and palette.
Master the six fundamentals and you can direct any model — Veo, Kling, or Sora — like a filmmaker rather than a slot machine. Tools that ship cinematic prompt presets and reusable variables, like Prompt Architects, accelerate the execution. But the cinematography skill is what makes the output look directed, not generated. Learn the fundamentals; let the tool handle the typing.
Frequently asked questions
Do AI video models actually understand cinematography terms? Yes. Veo 3.1, Kling 3.0, and Sora 2 are all trained to parse professional cinematographic vocabulary. Google's official guide confirms that terms like dolly shot, close-up, low angle, shallow depth of field, and wide-angle lens translate directly into the generated footage. Use technical terms; do not dumb them down.
What are the most important cinematography decisions for AI video? Six: framing (wide / medium / close-up), lens (24mm / 35mm / 50mm / 85mm), lighting source plus direction, camera movement (static / dolly / tracking), subject motion within the frame, and color palette or mood. Specifying these six separates film-quality output from generic stock-footage feel.
What is the best prompt structure for Veo 3? Google recommends a five-part formula: [Cinematography] + [Subject] + [Action] + [Context] + [Style & Ambiance]. Lead with the shot and camera work, name the subject, describe the action, set the environment, then close with aesthetic, mood, and lighting. For Veo 3.1, add an explicit audio line with dialogue in quotation marks.
Can I direct AI video without film school knowledge? Yes. The six fundamentals cover roughly 80% of cinematic decisions. Reference real films you love, describe what is happening in those scenes, and the model replicates the patterns. You do not need formal training to direct AI video well.
What's the biggest filmmaker mistake in AI video? Skipping lighting. Most amateur prompts say what is in the shot but not how it is lit. Lighting is half the look. "Golden-hour warm key from the west mixing with cool blue streetlamp fill" produces a different planet than "sunset." Always specify a source and a direction.
Should I describe shots like a script or like a shot list? Shot list. Scripts contain dialogue, character intent, and narrative that AI video does not render reliably yet. Shot lists describe what is visible: subject, action, framing, lens, lighting, and motion. Production-crew language transfers far better than screenwriter language.
How long can AI video clips be in 2026? It varies by model. Sora 2 generates 10 to 25 seconds with synchronized audio, Kling 3.0 extends single clips to roughly 15 seconds at up to native 4K/60fps, and Veo 3.1 produces 4 to 8-second clips at up to 4K with native audio. For longer sequences you stitch multiple shots and lock subject and style across them.
How do I keep the same character across multiple AI video shots? Lock the subject and style first. Generate one hero shot until the character, wardrobe, lighting, and palette are right, then reuse those exact modifiers across every other shot. Veo's JSON character mode and Kling 3.0's multi-shot consistency help maintain the same face and clothing across cuts.
By Nafiul Hasan — Founder of Prompt Architects, who has built and tested cinematic prompt systems across Veo, Kling, Midjourney, and Sora. Last updated: June 10, 2026.