HUMO AI: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

A unified framework for generating human videos from text, image, and audio. HUMO AI maintains subject identity, follows prompts, and aligns motion with sound using a progressive training strategy and time-adaptive guidance.

Demonstration video of HUMO AI multimodal-conditioned human video generation.

Key capabilities

HUMO AI focuses on people in motion. It blends text, reference images, and audio into a controlled generation process designed for prompt following, identity consistency, and audio-visual sync.

Human-Centric Focus

Generate people-centric videos that maintain identity and appearance across frames and scenes.

Collaborative Multi-Modal Conditioning

Balance text, image, and audio inputs for prompt following, subject consistency, and motion sync.

Flexible Inference Controls

Adjust frames, resolution, and guidance scales to fit creative and technical goals.

How HUMO AI works

Subject preservation

Minimal-invasive image injection keeps the subject’s identity while maintaining core generation and prompt-following abilities.

Audio-visual sync

Alongside audio cross-attention, a focus-by-predicting strategy helps the model associate audio with facial areas to improve sync.

Time-adaptive guidance

Classifier-free guidance weights can change across denoising steps to balance inputs at different stages of sampling.

Multimodal processing overview

Input modes and typical use

Text + Image (TI)

Customize appearance using a reference image and guide it with a text prompt for scene, clothing, or actions.

Text + Audio (TA)

Generate audio-synced motion guided by narrated or musical audio with a text description for content and scene.

Text + Image + Audio (TIA)

Combine identity control from images with prompt control and audio-driven motion for the most fine-grained output.

Generation settings

Typical setup: 97 frames at 25 FPS, 720p for quality. Adjust guidance scales for text and audio to balance prompt adherence and motion sync.

Practical pointers

Inputs

Prepare text, reference images, and audio as described in test_case.json.

Performance

For longer videos than 97 frames, expect reduced quality until new checkpoints are available.

Hardware

Multi-GPU inference is supported using FSDP and sequence parallel.

Resolution

480p and 720p are available. 720p usually gives better detail.

Configure HUMO AI with a simple YAML file to set frame count, resolution, and guidance weights.

Key configuration fields

Frames, height, width, and timesteps shape output length and quality. Guidance scales balance prompt adherence and motion sync.

Modes: TA for text+audio and TIA for text+image+audio. Choose based on your needs for identity control and motion control.

For better detail, prefer 720p. For faster iteration, start with fewer steps and 480p, then scale up.

generate.yaml

generation:
  frames: 97
  scale_a: 2.0
  scale_t: 7.5
  mode: "TIA"
  height: 720
  width: 1280

diffusion:
  timesteps:
    sampling:
      steps: 50

Tips

Start at lower resolution for drafts. Increase steps and resolution for final output. Keep audio clean or separate vocals if needed.

Keep prompts specific. Provide a clear reference image for subject consistency.

Reference

Key ideas include minimal-invasive image injection, focus-by-predicting for audio regions, progressive training, and time-adaptive classifier-free guidance.

Subjects: cs.CV and cs.MM. Example scripts support TA and TIA.

Typical settings

Frames

25 FPS

Frame rate

720p

Common resolution

30–50

Sampling steps

Where HUMO AI helps

Content creation

Create people-focused short videos guided by prompts and audio. Keep identity consistent across shots.

Education and research

Study audio-driven motion and prompt control. Compare settings and observe the effect on sync and identity.

Prototyping

Try prompts and references quickly at lower resolution. Scale to 720p for final results.

Voice and music driven scenes

Align gestures or facial regions with audio rhythm or speech using TA or TIA modes.

Subject-consistent series

Use the same reference image and prompt variants to generate consistent multi-shot sequences.

Audio-visual experimentation

Compare different audio clips with the same prompt to study gesture and lip sync.

FAQs

How long should videos be?

The current checkpoints are trained on 97 frames at 25 FPS. Longer outputs may reduce quality until longer checkpoints are released.

What resolutions are supported?

480p and 720p are supported. 720p usually gives better quality with more compute.

Does HUMO AI support multi-GPU inference?

Yes. FSDP with sequence parallel is supported for multi-GPU setups.

How do I prepare inputs?

Follow the test_case.json format. Provide a clear text prompt, a reference image if needed, and audio for TA/TIA.

Can I adjust guidance during sampling?

Yes. Time-adaptive classifier-free guidance allows different weights across denoising steps.

Typical configuration

Frames97

Resolution1280x720

Steps50

ModeTIA

What affects results?

Prompt specificity, clarity of reference images, and audio quality strongly influence generation. Guidance scales control emphasis on text vs audio.

Longer videos and higher resolution require more compute. Consider batch testing prompts at 480p before final 720p renders.

Audio-only motion cues may focus on facial regions. The focus-by-predicting strategy helps guide attention without explicit landmarks.

HUMO AI: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

A simple overview site summarizing methods, usage modes, and installation.