Generating consistent visuals across a multi-shot video is the single hardest problem in AI video production today. Subjects change faces between shots. Lighting flips from sunset to midday without warning. The camera suddenly switches from first-person to drone. You get six beautiful clips that look like they belong to six different films.
After rendering over 10,000 multi-shot videos on Shortly, we found a four-part prompt formula that reliably keeps the look coherent. It works because diffusion models, left alone, optimize for beauty per-shot, not consistency across-shots. You have to force the hand.
The formula: subject, action, mood, camera
Every scene description should hit these four dimensions in the same order, with the same language, every single shot. Like this:
SUBJECT — identical noun phrase across all shots
ACTION — what changes between shots (the story beat)
MOOD — identical emotional descriptors
CAMERA — identical POV and lens choiceWhen the subject, mood and camera stay identical word-for-word across shots, the model learns you mean the same character and tone. Only the action field is allowed to vary.
Example: a sleeping baby fox in a palm
Shot 1: Tiny baby fox curled up sleeping in my hand, soft morning light, first-person POV, 35mm lens. The fox slowly lifts its head.
Shot 2: Tiny baby fox curled up sleeping in my hand, soft morning light, first-person POV, 35mm lens. The fox yawns and stretches.
Shot 3: Tiny baby fox curled up sleeping in my hand, soft morning light, first-person POV, 35mm lens. The fox settles back down and closes its eyes.Notice how 80% of each prompt is identical. Only the last sentence changes — that is the action beat. The model reads this as "same scene, new moment" rather than "three different videos."
Common mistakes to avoid
- Changing the subject phrase between shots. "Baby fox" in shot 1 and "cute fox" in shot 2 breaks identity.
- Letting the mood drift. "Peaceful" in one shot, "mysterious" in the next — the palette changes.
- Switching camera vocabulary. "First-person POV" and "POV shot" are not the same to the model.
- Adding too many adjectives to the mood. Two is the max. More becomes noise.
Why this works
Diffusion models are conditional on the prompt text. When the text is mostly identical, the model's output distribution is mostly identical — same subject appearance, same lighting, same framing. Only the part you changed moves in latent space. You're essentially writing a storyboard in natural language, which is exactly what storyboards are for.
Until we have explicit character and style tokens (we're working on it), the prompt itself is your only lever. Use it.
Your turn
Pick a 3-shot idea and try the formula. Write your subject, mood, and camera line once — then write three action beats. Paste into Shortly and compare the result to free-form prompts. The jump in consistency is immediate.