MLA 027 AI Video End-to-End Workflow

Jul 14, 2025

Click to Play Episode

Prosumers can use Google Veo 3’s "High-Quality Chaining" for fast social media content. Indie filmmakers can achieve narrative consistency by combining Midjourney V7 for style, Kling for lip-synced dialogue, and Runway Gen-4 for camera control, while professional studios gain full control with a layered ComfyUI pipeline to output multi-layer EXR files for standard VFX compositing.

Multimedia Generative AI Mini Series

Resources

Resources best viewed here

Stanford CS236 Deep Generative Models

Show Notes

Learn Faster with a Walking DeskWalk While You Learn

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

📺 Heads up: this episode is from 2025, and the field moves fast. For current, weekly coverage of the full AI image and video pipeline, from a single shot to a one-person studio, listen to my new show, AI Video Generation.

Deep-Dive Reports

Open Source TTS

Leaderboards: Video, TTS, Image

AI Audio Tool Selection

Music: Use Suno for complete songs or Udio for high-quality components for professional editing.
Sound Effects: Use ElevenLabs' SFX for integrated podcast production or SFX Engine for large, licensed asset libraries for games and film.
Voice: ElevenLabs gives the most realistic voice output. Murf.ai offers an all-in-one studio for marketing, and Play.ht has a low-latency API for developers.
Open-Source TTS: For local use, StyleTTS 2 generates human-level speech, Coqui's XTTS-v2 is best for voice cloning from minimal input, and Piper TTS is a fast, CPU-friendly option.

I. Prosumer Workflow: Viral Video

Goal: Rapidly produce branded, short-form video for social media. This method bypasses Veo 3's weaker native "Extend" feature.

Toolchain
- Image Concept: GPT-4o (API: GPT-Image-1) for its strong prompt adherence, text rendering, and conversational refinement.
- Video Generation: Google Veo 3 for high single-shot quality and integrated ambient audio.
- Soundtrack: Udio for creating unique, "viral-style" music.
- Assembly: CapCut for its standard short-form editing features.
Workflow
1. Create Character Sheet (GPT-4o): Generate a primary character image with a detailed "locking" prompt, then use conversational follow-ups to create variations (poses, expressions) for visual consistency.
2. Generate Video (Veo 3): Use "High-Quality Chaining."
  - Clip 1: Generate an 8s clip from a character sheet image.
  - Extract Final Frame: Save the last frame of Clip 1.
  - Clip 2: Use the extracted frame as the image input for the next clip, using a "this then that" prompt to continue the action. Repeat as needed.
3. Create Music (Udio): Use Manual Mode with structured prompts ([Genre: ...], [Mood: ...]) to generate and extend a music track.
4. Final Edit (CapCut): Assemble clips, layer the Udio track over Veo's ambient audio, add text, and use "Auto Captions." Export in 9:16.

II. Indie Filmmaker Workflow: Narrative Shorts

Goal: Create cinematic short films with consistent characters and storytelling focus, using a hybrid of specialized tools.

Toolchain
- Visual Foundation: Midjourney V7 to establish character and style with --cref and --sref parameters.
- Dialogue Scenes: Kling for its superior lip-sync and character realism.
- B-Roll/Action: Runway Gen-4 for its Director Mode camera controls and Multi-Motion Brush.
- Voice Generation: ElevenLabs for emotive, high-fidelity voices.
- Edit & Color: DaVinci Resolve for its integrated edit, color, and VFX suite and favorable cost model.
Workflow
1. Create Visual Foundation (Midjourney V7): Generate a "hero" character image. Use its URL with --cref --cw 100 to create consistent character poses and with --sref to replicate the visual style in other shots. Assemble a reference set.
2. Create Dialogue Scenes (ElevenLabs -> Kling):
  - Generate the dialogue track in ElevenLabs and download the audio.
  - In Kling, generate a video of the character from a reference image with their mouth closed.
  - Use Kling's "Lip Sync" feature to apply the ElevenLabs audio to the neutral video for a perfect match.
3. Create B-Roll (Runway Gen-4): Use reference images from Midjourney. Apply precise camera moves with Director Mode or add localized, layered motion to static scenes with the Multi-Motion Brush.
4. Assemble & Grade (DaVinci Resolve): Edit clips and audio on the Edit page. On the Color page, use node-based tools to match shots from Kling and Runway, then apply a final creative look.

III. Professional Studio Workflow: Full Control

Goal: Achieve absolute pixel-level control, actor likeness, and integration into standard VFX pipelines using an open-source, modular approach.

Toolchain
- Core Engine: ComfyUI with Stable Diffusion models (e.g., SD3, FLUX).
- VFX Compositing: DaVinci Resolve (Fusion page) for node-based, multi-layer EXR compositing.
Control Stack & Workflow
1. Train Character LoRA: Train a custom LoRA on a 15-30 image dataset of the actor in ComfyUI to ensure true likeness.
2. Build ComfyUI Node Graph: Construct a generation pipeline in this order:
  - Loaders: Load base model, custom character LoRA, and text prompts (with LoRA trigger word).
  - ControlNet Stack: Chain multiple ControlNets to define structure (e.g., OpenPose for skeleton, Depth map for 3D layout).
  - IPAdapter-FaceID: Use the Plus v2 model as a final reinforcement layer to lock facial identity before animation.
  - AnimateDiff: Apply deterministic camera motion using Motion LoRAs (e.g., v2_lora_PanLeft.ckpt).
  - KSampler -> VAE Decode: Generate the image sequence.
3. Export Multi-Layer EXR: Use a node like mrv2SaveEXRImage to save the output as an EXR sequence (.exr). Configure for a professional pipeline: 32-bit float, linear color space, and PIZ/ZIP lossless compression. This preserves render passes (diffuse, specular, mattes) in a single file.
4. Composite in Fusion: In DaVinci Resolve, import the EXR sequence. Use Fusion's node graph to access individual layers, allowing separate adjustments to elements like color, highlights, and masks before integrating the AI asset into a final shot with a background plate.

Never Run Out of ML ContentGenerate Your Own Episodes

Want to go deeper on a topic this podcast didn't cover? Generate your own episodes - AI agents, transformers, diffusion models, whatever you're curious about. They appear right in your podcast app.Turn any ML topic into a podcast episode in your app.Start Generating →Start Generating →

Transcript

Download EPUB | Download PDF

Guide to AI Audio Tools: Music, SFX, and Voice

For music, choose based on your goal. For a complete, ready-to-use song for content like a YouTube video, use Suno. For high-quality audio components and inspiration to edit in a professional audio program, use Udio.

For sound effects, podcasters who want to add SFX and narration in the same tool should use ElevenLabs' integrated SFX generator. Game developers and filmmakers who need a large library of unique, licensed assets for a component-based workflow should use a specialized tool like SFX Engine.

For voice generation, ElevenLabs is the best for pure realism. However, different workflows may require other tools. Murf.ai offers an all-in-one studio for marketing teams, and Play.ht has a low-latency API for enterprise developers.

For open-source, local text-to-speech, several models are available. StyleTTS 2 achieves human-level quality and is best for generating natural speech without a reference voice. For local voice cloning with minimal audio input, Coqui's XTTS-v2 is the best model. For applications that need high speed on a CPU, Piper TTS is a lightweight and popular choice. The field is also improving with new models like Kokoro TTS.

Introduction: Three Tiers of Production

In 2025, generative AI is a standard part of professional media production. The best workflow depends on the user's goals and technical skill. This document outlines the most effective production method for three distinct user types:

The Prosumer: A social media manager or content creator who needs to produce high-impact, short-form content (e.g., for TikTok) quickly and easily.
The Independent Filmmaker: A storyteller focused on creating short films with consistent characters and a cinematic feel, who is willing to handle moderate technical complexity.
The Professional Studio: A VFX artist or production team that needs absolute control over the final image, actor likeness, and integration with standard VFX pipelines.

I. Prosumer Workflow: Creating Viral Video with Veo 3 "High-Quality Chaining"

1.1. Overview and Tools

This workflow is for social media managers and marketers who need to produce high-volume, visually engaging short-form content for platforms like TikTok, Instagram Reels, and YouTube Shorts. The main goals are speed and brand consistency.

The problem is that Google's Veo 3, while generating high-quality 8-second clips with integrated ambient audio, has a weak native "Extend" feature. This feature often uses a lower-quality model, causing a noticeable drop in visual consistency. The "High-Quality Chaining" method avoids this by manually connecting high-quality clips.

Toolchain:

Image Generation: GPT-4o (API: GPT-Image-1): Chosen for its excellent prompt adherence and text-rendering ability. It delivers the specific visual you ask for with minimal prompt engineering, and its conversational interface allows for quick visual refinement.
Video Generation: Google Veo 3: Used for its high single-shot quality and native audio generation, which simplifies production by creating video and ambient sound in one step.
Soundtrack Generation: Udio: Used to quickly create custom, catchy soundtracks. It balances ease of use with a manual mode for more control, allowing for the creation of "viral-style" audio that is more unique than stock music.
Final Assembly: CapCut: The standard video editor for short-form content. It is free, cross-platform, and has an intuitive interface with effects and text styles designed for TikTok and Reels.

Production Stage	Selected Tool	Strengths for Prosumer	Rationale
Character/Scene Concept	GPT-4o (GPT-Image-1)	Strong prompt adherence, text rendering, conversational refinement.	Delivers predictable visuals with minimal iteration, good for brand work.
Core Video Generation	Google Veo 3	High single-shot quality, integrated audio generation.	Simplifies production by combining video and ambient sound.
Music & Soundtrack	Udio	Fast generation, genre flexibility, manual mode for hooks.	Creates custom, "viral-style" audio that stands out.
Final Edit & Assembly	CapCut	Intuitive UI, cross-platform, built-in social media text/effects.	Industry standard for Reels/TikTok, allowing for fast deployment.

1.2. Step 1: Create a Character Sheet with GPT-4o

To ensure visual consistency, start by creating a set of reference images that define your character.

Process:

Write a "Locking" Prompt: Create a detailed prompt that defines the character's unchangeable features (e.g., facial structure, eye color) and variable features (e.g., clothing, expression).
- Example Locking Prompt: photorealistic headshot of "Aya," a 25-year-old Japanese woman with a sharp jawline, intelligent dark brown eyes, a small beauty mark under her left eye, and short, asymmetrical black hair with a single streak of electric blue. She is wearing a minimalist grey turtleneck sweater. The background is a solid, neutral grey. Professional studio lighting, subject has a neutral expression, 8k, hyperdetailed, sharp focus.
Generate the Main Image: Use the locking prompt in GPT-4o to generate the primary, front-facing image of the character.
Refine with Conversational Prompts: Use GPT-4o's conversational memory to create variations.
- Example Follow-up Prompts:
  - Perfect. Now, using the exact same character, show her from a 3/4 angle, looking slightly off-camera with a small, confident smile.
  - Excellent. For the next shot, keep the character identical but have her look surprised, with her hand partially covering her mouth. Keep the studio lighting and grey background.
  - One more. Show the same character from a side profile, looking thoughtful.
Assemble the Character Sheet: Collect 3-5 of these images. This set will be used as the source material for all video generation to ensure consistency.

1.3. Step 2: Use the High-Quality Chaining Method in Veo 3

This technique creates a video longer than 8 seconds while maintaining Veo 3's highest quality. It uses the final frame of one clip as the starting image for the next.

Process:

Generate Clip 1: In Veo 3, select the image-to-video option. Upload the main image of "Aya" from your character sheet.
Write the Prompt for Clip 1: Describe the initial action.
- Example Prompt: A photorealistic video of Aya. She is sitting in a modern, minimalist cafe. She looks directly at the camera, then turns her head to look out a large window at a rainy city street. The camera performs a slow, subtle push-in on her face. Include the ambient sounds of rain against the window and the soft murmur of a cafe.
Extract the Final Frame: After the clip generates, pause on the very last frame and take a high-resolution screenshot or use a frame-saving feature.
Generate Clip 2: Start a new generation. Upload the final frame of Clip 1 as the source image. This creates perfect visual continuity.
Write the Prompt for Clip 2 (The "This Then That" Method): Write a prompt that continues the action sequentially. This is the most effective prompting method for Veo 3, known as the "this then that" method.
- Example Prompt: Continuing from this exact moment, Aya turns her head back from the window to face forward. A look of sudden realization crosses her face. She picks up a white ceramic coffee mug from the table in front of her and brings it to her lips. The camera remains steady.
Repeat the Cycle: Continue this process - generate clip, extract final frame, use frame as input for the next clip - until you reach your desired video length. This method bypasses Veo 3's lower-quality internal sequencing tool.

1.4. Step 3: Create a Soundtrack with Udio

Create a custom, high-quality music track for your video.

Process:

Activate Manual Mode: Use Udio's "Manual Mode" to prevent the AI from rewriting your prompt.
Use Structured Prompts: Use tags to define musical elements.
- Example Prompt: [Genre: Lofi Hip-Hop, Chillwave], [Mood: melancholy, pensive, hopeful, rainy day]
Generate and Extend: Create a 33-second clip. Choose the best variation and use the "Extend" feature to add an [Intro] or [Outro], or to generate more sections to reach your desired length. For songs with lyrics, write them directly in the interface and use tags like [Verse] and [Chorus] to structure the song.
Download Track: Download the final audio file.

1.5. Step 4: Assemble and Polish in CapCut

Combine all assets into a final video.

Process:

Import Assets: Open CapCut and import your Veo 3 video clips and the Udio soundtrack.
Assemble the Timeline: Place the video clips on the timeline in order. The cuts should be seamless. If needed, use a short 2-4 frame cross-dissolve to smooth any minor jumps.
Mix the Audio: Place the Udio music track on an audio layer. Lower the volume of the Veo 3 clips' native ambient audio so it sits "under" the music, providing texture without overpowering the soundtrack.
Add Text and Captions: Use the text tool for branding or calls to action. Use the "Auto Captions" feature to generate subtitles, which is important for viewer retention on platforms where users watch with sound off.
Export: Export the final video in a 9:16 aspect ratio at 1080x1920 resolution (24 or 30 fps).

II. Indie Filmmaker Workflow: A Hybrid Approach for Narrative Shorts

2.1. Overview and Tools

This workflow is for creators making narrative short films (1-5 minutes). The focus is on storytelling, character consistency, and cinematic look. This requires managing more technical complexity.

The challenge is that no single video model excels at all types of shots (e.g., dialogue, action, establishing shots). The solution is a hybrid pipeline that uses specialized tools for each task.

Toolchain:

Image Generation: Midjourney V7: Chosen for creating the film's visual foundation. Its --cref (character reference) and --sref (style reference) parameters are used to establish a consistent character and cinematic look that can be used as a master reference across all other tools.
Video Generation (Hybrid):
- Kling: Used for all dialogue scenes due to its high-quality lip-sync and ability to render realistic characters with plausible physics and micro-expressions.
- Runway Gen-4: Used for all other shots (establishing, action, B-roll). Its Director Mode offers precise camera controls (pan, tilt, zoom) and a Multi-Motion Brush for adding localized motion.
Voice/Dialogue: ElevenLabs: Used to generate high-quality, emotive synthetic voices that can be convincingly synced to video.
Final Assembly, Color & Finishing: DaVinci Resolve: Chosen over Adobe Premiere Pro because its powerful free version and one-time purchase for the Studio version are more economical. It offers better performance with demanding codecs and integrates a top-tier color grading suite and node-based VFX environment (Fusion) in one application.

Narrative Function	Selected Tool	Strengths for Narrative	Rationale
Character/World Design	Midjourney V7	Character (--cref) & Style (--sref) Reference, cinematic quality.	Creates a master visual blueprint for consistency across all tools.
Dialogue Scenes	Kling	Superior lip-sync, high-fidelity character realism, physics simulation.	Essential for believable dialogue scenes.
Cinematic B-Roll/Action	Runway Gen-4	Advanced Camera Controls, Multi-Motion Brush, Director Mode.	Provides creative control over non-dialogue shots.
Voice Generation	ElevenLabs	High emotional fidelity, voice cloning, natural cadence.	Delivers performances that can be convincingly synced.
Edit, Color & Finish	DaVinci Resolve	All-in-one suite (Edit, Color, Fusion), performance, cost.	A professional, non-subscription editor with industry-best color tools.

2.2. Step 1: Create a Visual Foundation in Midjourney

Establish the film's visual language by creating a master set of reference images for the main character and aesthetic.

Process:

Generate the "Hero" Character Image: Create a single, definitive image of the main character using a cinematic prompt.
- Example Prompt: cinematic film still, a weary male detective, "Jack," mid-40s, unshaven with a five o'clock shadow, tired eyes, wearing a rumpled brown trench coat, standing on a rain-slicked neon-lit street in Tokyo at night. photorealistic, anamorphic lens flare, 35mm film grain, moody, noir aesthetic --ar 16:9 --style raw --v 7
Establish Character Reference with --cref: Get the URL of the hero image. Use it with the --cref parameter to generate new images of the same character in different poses while locking in their features.
- Example Prompt: a cinematic film still of the same man sitting in a dimly lit ramen bar, looking at a case file on the counter. --cref <URL of hero image> --cw 100 --v 7
- The --cw (character weight) parameter (0-100) controls how closely the new image matches the reference. Use 100 for maximum consistency.
Establish Style Reference with --sref: Use the same hero image URL with the --sref parameter to generate images that share the original's visual style (color, lighting, mood) without including the character.
- Example Prompt: cinematic film still, an empty, rain-slicked alleyway at night in Tokyo, neon signs reflecting in puddles on the ground. --sref <URL of hero image> --sw 800 --v 7
- The --sw (style weight) parameter controls the strength of the style transfer.
Create a Reference Set: Generate a "shot list" of 5-10 key still images (close-ups, medium shots, establishing shots). This visual guide will be used in Kling and Runway to maintain consistency.

2.3. Step 2: Create Dialogue Scenes with the ElevenLabs-to-Kling Pipeline

To get the best lip-sync, generate the audio and video separately and then combine them. Generating video from a prompt that includes dialogue often results in poor sync.

Process:

Generate Voice Track in ElevenLabs:
- Write or upload your dialogue script to ElevenLabs.
- Choose a pre-made voice or clone a custom one. Adjust stability and clarity sliders for a natural, emotive performance.
- Download the final dialogue as a high-quality audio file.
Generate "Neutral" Video in Kling:
- In Kling's Image-to-Video function, upload a close-up or medium shot of your character from the Midjourney reference set.
- Write a prompt describing the character's emotion and the scene, but explicitly state they are not speaking (e.g., "His mouth is closed"). This prevents random mouth movements.
- Example Prompt: A cinematic close-up of the detective, Jack. He listens intently to someone off-screen, his expression is serious and thoughtful. His mouth is closed. The scene is dimly lit, subtle ambient motion.
- Generate a 5-10 second clip.
Apply Lip-Sync in Kling:
- Select the neutral video you just generated. Use the "Lip Sync" or "Add Audio" feature.
- Upload the corresponding dialogue track from ElevenLabs.
- Kling will process the video, mapping the audio phonemes to the character's face to create a highly accurate lip-synced performance.

2.4. Step 3: Create Cinematic B-Roll with Runway Gen-4

Use Runway's advanced tools for scenes without dialogue.

Process:

Use Advanced Camera Control:
- To create a specific camera move, upload a reference image from your Midjourney set to Runway.
- Use the Director Mode camera controls to set a precise "Dolly," "Pan," or other movement. This gives you more deterministic control than text-based motion prompts.
Use the Multi-Motion Brush:
- This tool adds subtle life to static shots. Upload a static image (e.g., a car parked on a street).
- Use up to five brushes to animate different parts of the image independently.
  - Brush 1: Paint over tree leaves and set a low "Ambient Motion" to simulate a breeze.
  - Brush 2: Paint over a puddle and set a different subtle "Ambient Motion" for a ripple effect.
  - Brush 3: Paint over distant trees with a very low "Ambient Motion" to create a parallax effect.
- This technique adds layers of realism and production value.
Maintain Consistency: Always use a corresponding image from your Midjourney reference set as the source for every shot generated in Runway.

2.5. Step 4: Assemble and Color Grade in DaVinci Resolve

Combine all assets and apply a professional finish.

Process:

Import Assets: In Resolve, import all generated assets: lip-synced clips from Kling, B-roll from Runway, high-quality audio from ElevenLabs, and any music or SFX.
Edit the Narrative: Assemble the clips on the Edit page timeline. Place the clean audio tracks from ElevenLabs on separate audio layers, ensuring they are perfectly synced.
Perform a Professional Color Grade:
- On the Color page, use Resolve's node-based tools and scopes to match the color and contrast of clips from Kling and Runway.
- Once shots are matched, apply a final creative "look." For a noir film, you might desaturate colors, crush blacks, and add a cool blue tint to the shadows.
Export: On the Deliver page, choose an export preset. A high-bitrate H.264 or H.265 is good for web distribution, while a professional codec like Apple ProRes is better for festival submission or archival.

III. Professional Studio Workflow: Full Control with ComfyUI and Fusion

3.1. Overview and Tools

This workflow is for professional VFX artists and studios working on high-budget projects. The goals are complete control over every pixel, photorealistic quality, perfect actor likeness, and integration into a standard, non-destructive VFX pipeline.

The challenge with mainstream tools (Veo, Kling, Runway) is that they are "black boxes" that output compressed formats (like MP4) unsuitable for professional compositing. This workflow avoids them, instead using a modular, open-source pipeline in ComfyUI. This approach treats AI as a controllable render engine, outputting multi-layer OpenEXR files that are composited in a professional VFX application like DaVinci Resolve's Fusion page.

Toolchain:

Core Engine: Stable Diffusion Ecosystem: The only platform with the necessary open, modular architecture. This workflow uses models like SD3 or FLUX within the ComfyUI interface.
Control Layers (The Stack): Precision is achieved by layering multiple independent controls:
1. Custom LoRA Training: A LoRA (Low-Rank Adaptation) is trained on an actor's face or costume to embed their specific visual data, ensuring true likeness.
2. Multiple ControlNets: A stack of ControlNets dictates scene structure. For example, OpenPose controls the character's skeleton, a Depth map controls 3D layout, and Lineart preserves details.
3. IPAdapter-FaceID: After the LoRA and ControlNets, IPAdapter-FaceID (specifically the Plus v2 model) is applied as a final reinforcement layer to lock in facial identity and prevent drift during animation.
4. AnimateDiff + Motion LoRAs: Instead of text prompts for motion, this workflow uses the AnimateDiff module with specific Motion LoRAs trained on precise camera movements (e.g., "pan_left," "dolly_zoom") for deterministic, controllable cinematography.
VFX Compositing: DaVinci Resolve (Fusion Page): Resolve 20 and later versions have strong native support for multi-layer EXR compositing in the Fusion page. Its node-based architecture is ideal for this pipeline.

Studio Requirement	Core Technology	Functionality	Rationale
Perfect Actor Likeness	Custom LoRA Training	Fine-tunes a model on a specific actor's face/costume.	Achieves true, contract-level likeness for commercial work.
Bulletproof Identity Lock	IPAdapter-FaceID Plus v2	Reinforces facial identity on every frame, post-LoRA.	Final quality check to prevent identity drift in animation.
Precise Pose & Scene Layout	Multiple ControlNets (Pose, Depth)	Conditions generation on skeletal pose and 3D depth.	Allows artists to direct the scene with precision.
Specific Camera Movement	AnimateDiff + Motion LoRAs	Applies pre-defined motion vectors for pans, dollies, zooms.	Replaces vague motion prompts with controllable camera work.
Professional Post-Production	Multi-Layer EXR Export	Renders separate passes (diffuse, specular, matte) into one file.	Enables non-destructive editing of the final image.
Final Shot Assembly	DaVinci Resolve (Fusion)	Node-based compositing of EXR layers.	Integrates AI assets into a standard VFX pipeline.

3.2. Step 1: Train a Character LoRA in ComfyUI

To achieve an identical likeness, a custom LoRA must be trained on the subject.

Process:

Prepare the Dataset:
- Gather 15-30 high-quality images of your subject in various lighting, expressions, and angles.
- Crop and resize all images to a consistent square resolution (e.g., 1024x1024).
Automate Captioning:
- Training images must be captioned with descriptions that do not include the trigger word you will use to activate the LoRA.
- In ComfyUI, use a workflow with a BLIP Captioning Node to automatically generate a descriptive .txt file for each image.
Train the LoRA in ComfyUI:
- Use a dedicated LoRA training workflow, like the one in the ComfyUI-FluxTrainer custom node pack by Kijai.
- Configure the Trainer Node:
  - Image Path: Point to your image dataset directory.
  - Trigger Word: Define a unique token to activate the LoRA (e.g., character_ohwx).
  - Output Name: Name your LoRA file (e.g., MyCharacter_v1).
  - Training Parameters: Set epochs (iterations over the dataset) between 50 and 100. Adjust the learning_rate and network_dim/alpha as needed.
- Execute Training: Queue the prompt. The final .safetensors LoRA file will be saved to your ComfyUI/models/loras folder, ready for use.

3.3. Step 2: Build the ComfyUI Video Pipeline

This is the core generation process, structured as a logical node graph where each control system feeds into the next.

Node Chain Walkthrough:

Loaders: Load all assets first.
- Load Checkpoint: Loads the base model (e.g., flux1-dev-fp8.safetensors).
- Load LoRA: Loads your custom LoRA (MyCharacter_v1.safetensors).
- CLIP Text Encode (Prompt): Create two nodes, one for the positive prompt (which must include your trigger word, e.g., photograph of character_ohwx) and one for the negative prompt.
The ControlNet Stack: Chain multiple Apply ControlNet nodes together to define the scene's physical structure.
- The CONDITIONING output from the prompt node is fed into the first Apply ControlNet node. This node is conditioned on a reference pose image (using control_v11p_sd15_openpose.pth) and a preprocessor like DWPreprocessor.
- The CONDITIONING output of this first node is then piped into a second Apply ControlNet node, which could be conditioned on a Depth map. This layering creates a highly specific structural guide.
IPAdapter-FaceID:
- Pipe the final CONDITIONING output from the ControlNet stack into an IPAdapterFaceID node.
- This node also takes a clean reference photo of the actor's face and uses the ip-adapter-faceid-plusv2_sd15.bin model to lock the facial identity before animation.
AnimateDiff:
- Pass the fully conditioned MODEL to an AnimateDiff Loader node.
- Load a base motion model (e.g., mm_sd_v15_v2.ckpt) and a specific Motion LoRA Loader (e.g., v2_lora_PanLeft.ckpt) to apply a controllable camera motion.
KSampler:
- Feed the final conditioned model, prompts, and an Empty Latent Image node into the KSampler to generate the image sequence.
VAE Decode and Output:
- Pass the output LATENT from the KSampler to a VAE Decode node to convert it to pixel space.
- Pipe the final IMAGE output to the EXR saving nodes.

3.4. Step 3: Export Multi-Layer EXR Sequences for VFX

Render the output as a sequence of OpenEXR files, which contain multiple layers of data for post-production.

Process:

Isolate Render Passes: To separate elements for compositing, either run the generation multiple times with slight changes (e.g., rendering once with a green screen background to create a character matte) or use custom nodes (like AIO_Layered_Output) to split the generation into passes (diffuse, specular, mask) in one workflow.
Use a Dedicated SaveEXR Node:
- Pipe the final IMAGE output from the VAE Decode node into a saving node like SaveEXR (from the ComfyUI-HQ-Image-Save node pack) or mrv2SaveEXRImage.
- Configuration:
  - filepath: Set an image sequence path (e.g., D:/Renders/Shot_01/Shot_01_####.exr).
  - Compression: Choose lossless compression (PIZ or ZIP).
  - sRGB_to_linear: Set to True. Professional VFX pipelines use a linear color space for correct lighting math.
  - bit_depth: Set to 32-bit float to preserve maximum color and luminance data (HDR).

3.5. Step 4: Composite the Shot in DaVinci Resolve Fusion

Assemble the AI-generated EXR sequences into a final shot.

Process:

Import EXR Sequence: In Resolve's Fusion page, add a Loader or MediaIn node and import the EXR sequence.
Access Layers: With Resolve 20+, a single MediaIn node can contain all render passes. Connect a tool (e.g., a Color Corrector) to the MediaIn node, then use the inspector's dropdown menu to select which layer (diffuse, specular, matte) the tool should affect.
Re-Assemble the Shot with Nodes:
- Use a Channel Booleans node to extract the character matte.
- Pipe the diffuse pass into its own Color Corrector node to adjust colors.
- Pipe the specular pass into a separate Color Corrector to tweak highlights.
- Combine them with a Merge node set to "Plus" or "Add" mode.
- Use the extracted character matte as a mask for other effects, like applying a glow only to the character.
Final Integration: Merge the re-assembled AI character over a background plate. Add final touches like camera shake, lens distortion, and a unifying color grade to the entire composite. Connect the final node to the MediaOut node to send the finished shot to the Resolve timeline. This provides full creative control and integrates generative assets into a standard professional VFX pipeline.

MLA 027 AI Video End-to-End Workflow

Multimedia Generative AI Mini Series

Resources

Show Notes

@media (min-width:0px){.css-6k8fz8{display:none;}}@media (min-width:1200px){.css-6k8fz8{display:block;}}Learn Faster with a Walking Desk@media (min-width:0px){.css-1rb0nos{display:block;}}@media (min-width:1200px){.css-1rb0nos{display:none;}}Walk While You Learn

Deep-Dive Reports

AI Audio Tool Selection

I. Prosumer Workflow: Viral Video

II. Indie Filmmaker Workflow: Narrative Shorts

III. Professional Studio Workflow: Full Control

Never Run Out of ML ContentGenerate Your Own Episodes

Transcript

Guide to AI Audio Tools: Music, SFX, and Voice

Introduction: Three Tiers of Production

I. Prosumer Workflow: Creating Viral Video with Veo 3 "High-Quality Chaining"

1.1. Overview and Tools

1.2. Step 1: Create a Character Sheet with GPT-4o

1.3. Step 2: Use the High-Quality Chaining Method in Veo 3

1.4. Step 3: Create a Soundtrack with Udio

1.5. Step 4: Assemble and Polish in CapCut

II. Indie Filmmaker Workflow: A Hybrid Approach for Narrative Shorts

2.1. Overview and Tools

2.2. Step 1: Create a Visual Foundation in Midjourney

2.3. Step 2: Create Dialogue Scenes with the ElevenLabs-to-Kling Pipeline

2.4. Step 3: Create Cinematic B-Roll with Runway Gen-4

2.5. Step 4: Assemble and Color Grade in DaVinci Resolve

III. Professional Studio Workflow: Full Control with ComfyUI and Fusion

3.1. Overview and Tools

3.2. Step 1: Train a Character LoRA in ComfyUI

3.3. Step 2: Build the ComfyUI Video Pipeline

3.4. Step 3: Export Multi-Layer EXR Sequences for VFX

3.5. Step 4: Composite the Shot in DaVinci Resolve Fusion

Learn Faster with a Walking DeskWalk While You Learn