YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AI Video Avatar Setup

Complete setup for real-time AI video avatars with voice cloning and lip sync.

Features

Voice Cloning: Clone any voice from a short audio sample using StyleTTS2
Lip Sync: High-quality lip synchronization using MuseTalk V1.5
Real-Time Capable: RTF < 1 (faster than real-time) on RTX 3090+
One-Click Install: Automated setup script for cloud GPUs

Performance (RTX 3090)

Component	RTF	Speed
StyleTTS2 (steps=5)	0.04	22x real-time
MuseTalk V1.5	0.25-0.67	1.5-4x real-time

Requirements

NVIDIA GPU with CUDA 12.x (tested on RTX 3090, V100)
Ubuntu 22.04+ or similar
Python 3.10+
~20GB disk space for models

Quick Start

# Clone and install
git clone https://github.com/yourusername/ai-video-setup.git
cd ai-video-setup
chmod +x install.sh
./install.sh

# Generate audio with cloned voice
python scripts/generate_audio.py --text "Hello world" --voice voice_ref.wav -o output.wav

# Run lip sync
python scripts/run_lipsync.py --video avatar.mp4 --audio output.wav -o ./output

# Or run the full pipeline
python scripts/full_pipeline.py --youtube-url "https://..." --text "Your text" -o ./output

Critical Package Versions

These versions are required and tested to work together:

accelerate==0.25.0
diffusers==0.21.0
huggingface-hub==0.25.0

Warning: Newer versions cause cannot import clear_device_cache error.

Scripts

Script	Description
`install.sh`	Complete installation (PyTorch, MuseTalk, StyleTTS2)
`scripts/generate_audio.py`	Generate audio with voice cloning
`scripts/run_lipsync.py`	Run lip sync on video
`scripts/extract_voice_ref.py`	Extract voice reference from YouTube/video
`scripts/full_pipeline.py`	Complete pipeline (YouTube -> lip sync video)
`scripts/realtime_avatar.py`	Real-time avatar with pre-loaded models

Video Input Recommendations

For best lip sync results:

Duration: 15-30 seconds (more frames = better variety)
Resolution: 640x360 to 720p (larger is slower, not better)
FPS: 24-30 fps
Content: Face centered, good lighting, neutral expression
Format: MP4 with H.264 codec

Preprocessing Example

# Convert 4K video to optimal format
ffmpeg -i input_4k.mp4 -vf "scale=640:-2,fps=24" -c:a copy avatar.mp4

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Input Text    │────▶│   StyleTTS2     │────▶│   Audio WAV     │
└─────────────────┘     │  (Voice Clone)  │     └────────┬────────┘
                        └─────────────────┘              │
┌─────────────────┐                                      │
│  Avatar Video   │──────────────────────────────────────┤
└─────────────────┘                                      │
                        ┌─────────────────┐              │
                        │  MuseTalk V1.5  │◀─────────────┘
                        │   (Lip Sync)    │
                        └────────┬────────┘
                                 │
                        ┌────────▼────────┐
                        │  Output Video   │
                        │  (with audio)   │
                        └─────────────────┘

Troubleshooting

"cannot import clear_device_cache"

pip install accelerate==0.25.0 diffusers==0.21.0 huggingface-hub==0.25.0

PyTorch 2.6 pickle error

The scripts include a fix for weights_only parameter. If you see pickle errors, ensure the patch is applied:

import torch
original_load = torch.load
def patched_load(*args, **kwargs):
    kwargs['weights_only'] = False
    return original_load(*args, **kwargs)
torch.load = patched_load

NLTK punkt error

python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

License

This project integrates:

StyleTTS2 - MIT License
MuseTalk - Custom License

Acknowledgments

StyleTTS2 team for the amazing TTS model
TMElyralab for MuseTalk V1.5
Tested on vast.ai GPU instances

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support