A unified framework for generating human videos from text, image, and audio. HUMO AI maintains subject identity, follows prompts, and aligns motion with sound using a progressive training strategy and time-adaptive guidance.
Demonstration video of HUMO AI multimodal-conditioned human video generation.
HUMO AI focuses on people in motion. It blends text, reference images, and audio into a controlled generation process designed for prompt following, identity consistency, and audio-visual sync.
Generate people-centric videos that maintain identity and appearance across frames and scenes.
Balance text, image, and audio inputs for prompt following, subject consistency, and motion sync.
Adjust frames, resolution, and guidance scales to fit creative and technical goals.
Minimal-invasive image injection keeps the subject’s identity while maintaining core generation and prompt-following abilities.
Alongside audio cross-attention, a focus-by-predicting strategy helps the model associate audio with facial areas to improve sync.
Classifier-free guidance weights can change across denoising steps to balance inputs at different stages of sampling.
Multimodal processing overview
Customize appearance using a reference image and guide it with a text prompt for scene, clothing, or actions.
Generate audio-synced motion guided by narrated or musical audio with a text description for content and scene.
Combine identity control from images with prompt control and audio-driven motion for the most fine-grained output.
Typical setup: 97 frames at 25 FPS, 720p for quality. Adjust guidance scales for text and audio to balance prompt adherence and motion sync.
Prepare text, reference images, and audio as described in test_case.json.
For longer videos than 97 frames, expect reduced quality until new checkpoints are available.
Multi-GPU inference is supported using FSDP and sequence parallel.
480p and 720p are available. 720p usually gives better detail.
Configure HUMO AI with a simple YAML file to set frame count, resolution, and guidance weights.
Frames, height, width, and timesteps shape output length and quality. Guidance scales balance prompt adherence and motion sync.
Modes: TA for text+audio and TIA for text+image+audio. Choose based on your needs for identity control and motion control.
For better detail, prefer 720p. For faster iteration, start with fewer steps and 480p, then scale up.
generation:
frames: 97
scale_a: 2.0
scale_t: 7.5
mode: "TIA"
height: 720
width: 1280
diffusion:
timesteps:
sampling:
steps: 50
Start at lower resolution for drafts. Increase steps and resolution for final output. Keep audio clean or separate vocals if needed.
Keep prompts specific. Provide a clear reference image for subject consistency.
Key ideas include minimal-invasive image injection, focus-by-predicting for audio regions, progressive training, and time-adaptive classifier-free guidance.
Subjects: cs.CV and cs.MM. Example scripts support TA and TIA.
Go to the installation page and prepare your text, image, and audio inputs.
Create people-focused short videos guided by prompts and audio. Keep identity consistent across shots.
Study audio-driven motion and prompt control. Compare settings and observe the effect on sync and identity.
Try prompts and references quickly at lower resolution. Scale to 720p for final results.
Align gestures or facial regions with audio rhythm or speech using TA or TIA modes.
Use the same reference image and prompt variants to generate consistent multi-shot sequences.
Compare different audio clips with the same prompt to study gesture and lip sync.
The current checkpoints are trained on 97 frames at 25 FPS. Longer outputs may reduce quality until longer checkpoints are released.
480p and 720p are supported. 720p usually gives better quality with more compute.
Yes. FSDP with sequence parallel is supported for multi-GPU setups.
Follow the test_case.json format. Provide a clear text prompt, a reference image if needed, and audio for TA/TIA.
Yes. Time-adaptive classifier-free guidance allows different weights across denoising steps.
Prompt specificity, clarity of reference images, and audio quality strongly influence generation. Guidance scales control emphasis on text vs audio.
Longer videos and higher resolution require more compute. Consider batch testing prompts at 480p before final 720p renders.
Audio-only motion cues may focus on facial regions. The focus-by-predicting strategy helps guide attention without explicit landmarks.
A simple overview site summarizing methods, usage modes, and installation.