Click to Play Episode
Prosumers can use Google Veo 3’s "High-Quality Chaining" for fast social media content. Indie filmmakers can achieve narrative consistency by combining Midjourney V7 for style, Kling for lip-synced dialogue, and Runway Gen-4 for camera control, while professional studios gain full control with a layered ComfyUI pipeline to output multi-layer EXR files for standard VFX compositing.
📺 Heads up: this episode is from 2025, and the field moves fast. For current, weekly coverage of the full AI image and video pipeline, from a single shot to a one-person studio, listen to my new show, AI Video Generation.
Leaderboards: Video, TTS, Image
Goal: Rapidly produce branded, short-form video for social media. This method bypasses Veo 3's weaker native "Extend" feature.
Clip 1: Generate an 8s clip from a character sheet image.Extract Final Frame: Save the last frame of Clip 1.Clip 2: Use the extracted frame as the image input for the next clip, using a "this then that" prompt to continue the action. Repeat as needed.[Genre: ...], [Mood: ...]) to generate and extend a music track.Goal: Create cinematic short films with consistent characters and storytelling focus, using a hybrid of specialized tools.
--cref and --sref parameters.--cref --cw 100 to create consistent character poses and with --sref to replicate the visual style in other shots. Assemble a reference set.Goal: Achieve absolute pixel-level control, actor likeness, and integration into standard VFX pipelines using an open-source, modular approach.
Loaders: Load base model, custom character LoRA, and text prompts (with LoRA trigger word).ControlNet Stack: Chain multiple ControlNets to define structure (e.g., OpenPose for skeleton, Depth map for 3D layout).IPAdapter-FaceID: Use the Plus v2 model as a final reinforcement layer to lock facial identity before animation.AnimateDiff: Apply deterministic camera motion using Motion LoRAs (e.g., v2_lora_PanLeft.ckpt).KSampler -> VAE Decode: Generate the image sequence.mrv2SaveEXRImage to save the output as an EXR sequence (.exr). Configure for a professional pipeline: 32-bit float, linear color space, and PIZ/ZIP lossless compression. This preserves render passes (diffuse, specular, mattes) in a single file.For music, choose based on your goal. For a complete, ready-to-use song for content like a YouTube video, use Suno. For high-quality audio components and inspiration to edit in a professional audio program, use Udio.
For sound effects, podcasters who want to add SFX and narration in the same tool should use ElevenLabs' integrated SFX generator. Game developers and filmmakers who need a large library of unique, licensed assets for a component-based workflow should use a specialized tool like SFX Engine.
For voice generation, ElevenLabs is the best for pure realism. However, different workflows may require other tools. Murf.ai offers an all-in-one studio for marketing teams, and Play.ht has a low-latency API for enterprise developers.
For open-source, local text-to-speech, several models are available. StyleTTS 2 achieves human-level quality and is best for generating natural speech without a reference voice. For local voice cloning with minimal audio input, Coqui's XTTS-v2 is the best model. For applications that need high speed on a CPU, Piper TTS is a lightweight and popular choice. The field is also improving with new models like Kokoro TTS.
In 2025, generative AI is a standard part of professional media production. The best workflow depends on the user's goals and technical skill. This document outlines the most effective production method for three distinct user types:
This workflow is for social media managers and marketers who need to produce high-volume, visually engaging short-form content for platforms like TikTok, Instagram Reels, and YouTube Shorts. The main goals are speed and brand consistency.
The problem is that Google's Veo 3, while generating high-quality 8-second clips with integrated ambient audio, has a weak native "Extend" feature. This feature often uses a lower-quality model, causing a noticeable drop in visual consistency. The "High-Quality Chaining" method avoids this by manually connecting high-quality clips.
Toolchain:
| Production Stage | Selected Tool | Strengths for Prosumer | Rationale |
|---|---|---|---|
| Character/Scene Concept | GPT-4o (GPT-Image-1) | Strong prompt adherence, text rendering, conversational refinement. | Delivers predictable visuals with minimal iteration, good for brand work. |
| Core Video Generation | Google Veo 3 | High single-shot quality, integrated audio generation. | Simplifies production by combining video and ambient sound. |
| Music & Soundtrack | Udio | Fast generation, genre flexibility, manual mode for hooks. | Creates custom, "viral-style" audio that stands out. |
| Final Edit & Assembly | CapCut | Intuitive UI, cross-platform, built-in social media text/effects. | Industry standard for Reels/TikTok, allowing for fast deployment. |
To ensure visual consistency, start by creating a set of reference images that define your character.
Process:
photorealistic headshot of "Aya," a 25-year-old Japanese woman with a sharp jawline, intelligent dark brown eyes, a small beauty mark under her left eye, and short, asymmetrical black hair with a single streak of electric blue. She is wearing a minimalist grey turtleneck sweater. The background is a solid, neutral grey. Professional studio lighting, subject has a neutral expression, 8k, hyperdetailed, sharp focus.Perfect. Now, using the exact same character, show her from a 3/4 angle, looking slightly off-camera with a small, confident smile.Excellent. For the next shot, keep the character identical but have her look surprised, with her hand partially covering her mouth. Keep the studio lighting and grey background.One more. Show the same character from a side profile, looking thoughtful.This technique creates a video longer than 8 seconds while maintaining Veo 3's highest quality. It uses the final frame of one clip as the starting image for the next.
Process:
A photorealistic video of Aya. She is sitting in a modern, minimalist cafe. She looks directly at the camera, then turns her head to look out a large window at a rainy city street. The camera performs a slow, subtle push-in on her face. Include the ambient sounds of rain against the window and the soft murmur of a cafe.Continuing from this exact moment, Aya turns her head back from the window to face forward. A look of sudden realization crosses her face. She picks up a white ceramic coffee mug from the table in front of her and brings it to her lips. The camera remains steady.Create a custom, high-quality music track for your video.
Process:
[Genre: Lofi Hip-Hop, Chillwave], [Mood: melancholy, pensive, hopeful, rainy day][Intro] or [Outro], or to generate more sections to reach your desired length. For songs with lyrics, write them directly in the interface and use tags like [Verse] and [Chorus] to structure the song.Combine all assets into a final video.
Process:
This workflow is for creators making narrative short films (1-5 minutes). The focus is on storytelling, character consistency, and cinematic look. This requires managing more technical complexity.
The challenge is that no single video model excels at all types of shots (e.g., dialogue, action, establishing shots). The solution is a hybrid pipeline that uses specialized tools for each task.
Toolchain:
--cref (character reference) and --sref (style reference) parameters are used to establish a consistent character and cinematic look that can be used as a master reference across all other tools.| Narrative Function | Selected Tool | Strengths for Narrative | Rationale |
|---|---|---|---|
| Character/World Design | Midjourney V7 | Character (--cref) & Style (--sref) Reference, cinematic quality. | Creates a master visual blueprint for consistency across all tools. |
| Dialogue Scenes | Kling | Superior lip-sync, high-fidelity character realism, physics simulation. | Essential for believable dialogue scenes. |
| Cinematic B-Roll/Action | Runway Gen-4 | Advanced Camera Controls, Multi-Motion Brush, Director Mode. | Provides creative control over non-dialogue shots. |
| Voice Generation | ElevenLabs | High emotional fidelity, voice cloning, natural cadence. | Delivers performances that can be convincingly synced. |
| Edit, Color & Finish | DaVinci Resolve | All-in-one suite (Edit, Color, Fusion), performance, cost. | A professional, non-subscription editor with industry-best color tools. |
Establish the film's visual language by creating a master set of reference images for the main character and aesthetic.
Process:
cinematic film still, a weary male detective, "Jack," mid-40s, unshaven with a five o'clock shadow, tired eyes, wearing a rumpled brown trench coat, standing on a rain-slicked neon-lit street in Tokyo at night. photorealistic, anamorphic lens flare, 35mm film grain, moody, noir aesthetic --ar 16:9 --style raw --v 7--cref: Get the URL of the hero image. Use it with the --cref parameter to generate new images of the same character in different poses while locking in their features.a cinematic film still of the same man sitting in a dimly lit ramen bar, looking at a case file on the counter. --cref <URL of hero image> --cw 100 --v 7--cw (character weight) parameter (0-100) controls how closely the new image matches the reference. Use 100 for maximum consistency.--sref: Use the same hero image URL with the --sref parameter to generate images that share the original's visual style (color, lighting, mood) without including the character.cinematic film still, an empty, rain-slicked alleyway at night in Tokyo, neon signs reflecting in puddles on the ground. --sref <URL of hero image> --sw 800 --v 7--sw (style weight) parameter controls the strength of the style transfer.To get the best lip-sync, generate the audio and video separately and then combine them. Generating video from a prompt that includes dialogue often results in poor sync.
Process:
A cinematic close-up of the detective, Jack. He listens intently to someone off-screen, his expression is serious and thoughtful. His mouth is closed. The scene is dimly lit, subtle ambient motion.Use Runway's advanced tools for scenes without dialogue.
Process:
Combine all assets and apply a professional finish.
Process:
This workflow is for professional VFX artists and studios working on high-budget projects. The goals are complete control over every pixel, photorealistic quality, perfect actor likeness, and integration into a standard, non-destructive VFX pipeline.
The challenge with mainstream tools (Veo, Kling, Runway) is that they are "black boxes" that output compressed formats (like MP4) unsuitable for professional compositing. This workflow avoids them, instead using a modular, open-source pipeline in ComfyUI. This approach treats AI as a controllable render engine, outputting multi-layer OpenEXR files that are composited in a professional VFX application like DaVinci Resolve's Fusion page.
Toolchain:
| Studio Requirement | Core Technology | Functionality | Rationale |
|---|---|---|---|
| Perfect Actor Likeness | Custom LoRA Training | Fine-tunes a model on a specific actor's face/costume. | Achieves true, contract-level likeness for commercial work. |
| Bulletproof Identity Lock | IPAdapter-FaceID Plus v2 | Reinforces facial identity on every frame, post-LoRA. | Final quality check to prevent identity drift in animation. |
| Precise Pose & Scene Layout | Multiple ControlNets (Pose, Depth) | Conditions generation on skeletal pose and 3D depth. | Allows artists to direct the scene with precision. |
| Specific Camera Movement | AnimateDiff + Motion LoRAs | Applies pre-defined motion vectors for pans, dollies, zooms. | Replaces vague motion prompts with controllable camera work. |
| Professional Post-Production | Multi-Layer EXR Export | Renders separate passes (diffuse, specular, matte) into one file. | Enables non-destructive editing of the final image. |
| Final Shot Assembly | DaVinci Resolve (Fusion) | Node-based compositing of EXR layers. | Integrates AI assets into a standard VFX pipeline. |
To achieve an identical likeness, a custom LoRA must be trained on the subject.
Process:
character_ohwx).MyCharacter_v1).learning_rate and network_dim/alpha as needed..safetensors LoRA file will be saved to your ComfyUI/models/loras folder, ready for use.This is the core generation process, structured as a logical node graph where each control system feeds into the next.
Node Chain Walkthrough:
flux1-dev-fp8.safetensors).MyCharacter_v1.safetensors).photograph of character_ohwx) and one for the negative prompt.Apply ControlNet nodes together to define the scene's physical structure.Apply ControlNet node. This node is conditioned on a reference pose image (using control_v11p_sd15_openpose.pth) and a preprocessor like DWPreprocessor.Apply ControlNet node, which could be conditioned on a Depth map. This layering creates a highly specific structural guide.IPAdapterFaceID node.ip-adapter-faceid-plusv2_sd15.bin model to lock the facial identity before animation.AnimateDiff Loader node.mm_sd_v15_v2.ckpt) and a specific Motion LoRA Loader (e.g., v2_lora_PanLeft.ckpt) to apply a controllable camera motion.Empty Latent Image node into the KSampler to generate the image sequence.KSampler to a VAE Decode node to convert it to pixel space.Render the output as a sequence of OpenEXR files, which contain multiple layers of data for post-production.
Process:
SaveEXR (from the ComfyUI-HQ-Image-Save node pack) or mrv2SaveEXRImage.filepath: Set an image sequence path (e.g., D:/Renders/Shot_01/Shot_01_####.exr).Compression: Choose lossless compression (PIZ or ZIP).sRGB_to_linear: Set to True. Professional VFX pipelines use a linear color space for correct lighting math.bit_depth: Set to 32-bit float to preserve maximum color and luminance data (HDR).Assemble the AI-generated EXR sequences into a final shot.
Process:
Loader or MediaIn node and import the EXR sequence.MediaIn node can contain all render passes. Connect a tool (e.g., a Color Corrector) to the MediaIn node, then use the inspector's dropdown menu to select which layer (diffuse, specular, matte) the tool should affect.Channel Booleans node to extract the character matte.Color Corrector node to adjust colors.Color Corrector to tweak highlights.Merge node set to "Plus" or "Add" mode.MediaOut node to send the finished shot to the Resolve timeline. This provides full creative control and integrates generative assets into a standard professional VFX pipeline.