Install HUMO AI

Set up a local environment to run HUMO AI for human-centric video generation guided by text, image, and audio. This guide mirrors the configuration shown on the homepage.

Requirements

  • Python 3.11 (Conda recommended)
  • PyTorch 2.5.1 with CUDA 12.4, torchvision 0.20.1, torchaudio 2.5.1
  • flash_attn 2.6.3
  • ffmpeg (via conda-forge)
  • GPU with sufficient VRAM; multi-GPU supported with FSDP + sequence parallel

Environment setup

conda create -n humo python=3.11
conda activate humo
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.6.3
pip install -r requirements.txt
conda install -c conda-forge ffmpeg

Models

Prepare model weights locally. Typical components include a HuMo checkpoint, Wan 2.1 VAE and text encoder, Whisper-large-v3 for audio, and an optional audio separator. Store them under a local weights directory.

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3
huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator

Configure and run

Adjust generation settings in a YAML file. These defaults reflect common practice and match the homepage examples.

generation:
  frames: 97
  scale_a: 2.0
  scale_t: 7.5
  mode: "TIA"   # TA or TIA
  height: 720
  width: 1280

diffusion:
  timesteps:
    sampling:
      steps: 50

Inference scripts

  • Text + Audio: run your TA script (e.g., bash infer_ta.sh)
  • Text + Image + Audio: run your TIA script (e.g., bash infer_tia.sh)

Practical tips

  • Keep audio clean; separate vocals if needed.
  • Start at 480p and fewer steps to iterate quickly, then switch to 720p and ~50 steps.
  • Provide clear prompts and a representative reference image for subject consistency.
  • For outputs longer than 97 frames, expect a quality drop until longer checkpoints are available.