YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AI Video Avatar Setup

Complete setup for real-time AI video avatars with voice cloning and lip sync.

Features

  • Voice Cloning: Clone any voice from a short audio sample using StyleTTS2
  • Lip Sync: High-quality lip synchronization using MuseTalk V1.5
  • Real-Time Capable: RTF < 1 (faster than real-time) on RTX 3090+
  • One-Click Install: Automated setup script for cloud GPUs

Performance (RTX 3090)

Component RTF Speed
StyleTTS2 (steps=5) 0.04 22x real-time
MuseTalk V1.5 0.25-0.67 1.5-4x real-time

Requirements

  • NVIDIA GPU with CUDA 12.x (tested on RTX 3090, V100)
  • Ubuntu 22.04+ or similar
  • Python 3.10+
  • ~20GB disk space for models

Quick Start

# Clone and install
git clone https://github.com/yourusername/ai-video-setup.git
cd ai-video-setup
chmod +x install.sh
./install.sh

# Generate audio with cloned voice
python scripts/generate_audio.py --text "Hello world" --voice voice_ref.wav -o output.wav

# Run lip sync
python scripts/run_lipsync.py --video avatar.mp4 --audio output.wav -o ./output

# Or run the full pipeline
python scripts/full_pipeline.py --youtube-url "https://..." --text "Your text" -o ./output

Critical Package Versions

These versions are required and tested to work together:

accelerate==0.25.0
diffusers==0.21.0
huggingface-hub==0.25.0

Warning: Newer versions cause cannot import clear_device_cache error.

Scripts

Script Description
install.sh Complete installation (PyTorch, MuseTalk, StyleTTS2)
scripts/generate_audio.py Generate audio with voice cloning
scripts/run_lipsync.py Run lip sync on video
scripts/extract_voice_ref.py Extract voice reference from YouTube/video
scripts/full_pipeline.py Complete pipeline (YouTube -> lip sync video)
scripts/realtime_avatar.py Real-time avatar with pre-loaded models

Video Input Recommendations

For best lip sync results:

  • Duration: 15-30 seconds (more frames = better variety)
  • Resolution: 640x360 to 720p (larger is slower, not better)
  • FPS: 24-30 fps
  • Content: Face centered, good lighting, neutral expression
  • Format: MP4 with H.264 codec

Preprocessing Example

# Convert 4K video to optimal format
ffmpeg -i input_4k.mp4 -vf "scale=640:-2,fps=24" -c:a copy avatar.mp4

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Input Text    │────▢│   StyleTTS2     │────▢│   Audio WAV     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  (Voice Clone)  β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                      β”‚
β”‚  Avatar Video   │───────────────────────────────────────
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                      β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
                        β”‚  MuseTalk V1.5  β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚   (Lip Sync)    β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚  Output Video   β”‚
                        β”‚  (with audio)   β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Troubleshooting

"cannot import clear_device_cache"

pip install accelerate==0.25.0 diffusers==0.21.0 huggingface-hub==0.25.0

PyTorch 2.6 pickle error

The scripts include a fix for weights_only parameter. If you see pickle errors, ensure the patch is applied:

import torch
original_load = torch.load
def patched_load(*args, **kwargs):
    kwargs['weights_only'] = False
    return original_load(*args, **kwargs)
torch.load = patched_load

NLTK punkt error

python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

License

This project integrates:

Acknowledgments

  • StyleTTS2 team for the amazing TTS model
  • TMElyralab for MuseTalk V1.5
  • Tested on vast.ai GPU instances
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support