Install HUMO AI

Set up a local environment to run HUMO AI for human-centric video generation guided by text, image, and audio. This guide mirrors the configuration shown on the homepage.

Requirements

Python 3.11 (Conda recommended)
PyTorch 2.5.1 with CUDA 12.4, torchvision 0.20.1, torchaudio 2.5.1
flash_attn 2.6.3
ffmpeg (via conda-forge)
GPU with sufficient VRAM; multi-GPU supported with FSDP + sequence parallel

Environment setup

conda create -n humo python=3.11
conda activate humo
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.6.3
pip install -r requirements.txt
conda install -c conda-forge ffmpeg

Models

Prepare model weights locally. Typical components include a HuMo checkpoint, Wan 2.1 VAE and text encoder, Whisper-large-v3 for audio, and an optional audio separator. Store them under a local weights directory.

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3
huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator

Configure and run

Adjust generation settings in a YAML file. These defaults reflect common practice and match the homepage examples.

generation:
  frames: 97
  scale_a: 2.0
  scale_t: 7.5
  mode: "TIA"   # TA or TIA
  height: 720
  width: 1280

diffusion:
  timesteps:
    sampling:
      steps: 50

Inference scripts

Text + Audio: run your TA script (e.g., bash infer_ta.sh)
Text + Image + Audio: run your TIA script (e.g., bash infer_tia.sh)

Practical tips

Keep audio clean; separate vocals if needed.
Start at 480p and fewer steps to iterate quickly, then switch to 720p and ~50 steps.
Provide clear prompts and a representative reference image for subject consistency.
For outputs longer than 97 frames, expect a quality drop until longer checkpoints are available.