YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
AI Video Avatar Setup
Complete setup for real-time AI video avatars with voice cloning and lip sync.
Features
- Voice Cloning: Clone any voice from a short audio sample using StyleTTS2
- Lip Sync: High-quality lip synchronization using MuseTalk V1.5
- Real-Time Capable: RTF < 1 (faster than real-time) on RTX 3090+
- One-Click Install: Automated setup script for cloud GPUs
Performance (RTX 3090)
| Component | RTF | Speed |
|---|---|---|
| StyleTTS2 (steps=5) | 0.04 | 22x real-time |
| MuseTalk V1.5 | 0.25-0.67 | 1.5-4x real-time |
Requirements
- NVIDIA GPU with CUDA 12.x (tested on RTX 3090, V100)
- Ubuntu 22.04+ or similar
- Python 3.10+
- ~20GB disk space for models
Quick Start
# Clone and install
git clone https://github.com/yourusername/ai-video-setup.git
cd ai-video-setup
chmod +x install.sh
./install.sh
# Generate audio with cloned voice
python scripts/generate_audio.py --text "Hello world" --voice voice_ref.wav -o output.wav
# Run lip sync
python scripts/run_lipsync.py --video avatar.mp4 --audio output.wav -o ./output
# Or run the full pipeline
python scripts/full_pipeline.py --youtube-url "https://..." --text "Your text" -o ./output
Critical Package Versions
These versions are required and tested to work together:
accelerate==0.25.0
diffusers==0.21.0
huggingface-hub==0.25.0
Warning: Newer versions cause
cannot import clear_device_cacheerror.
Scripts
| Script | Description |
|---|---|
install.sh |
Complete installation (PyTorch, MuseTalk, StyleTTS2) |
scripts/generate_audio.py |
Generate audio with voice cloning |
scripts/run_lipsync.py |
Run lip sync on video |
scripts/extract_voice_ref.py |
Extract voice reference from YouTube/video |
scripts/full_pipeline.py |
Complete pipeline (YouTube -> lip sync video) |
scripts/realtime_avatar.py |
Real-time avatar with pre-loaded models |
Video Input Recommendations
For best lip sync results:
- Duration: 15-30 seconds (more frames = better variety)
- Resolution: 640x360 to 720p (larger is slower, not better)
- FPS: 24-30 fps
- Content: Face centered, good lighting, neutral expression
- Format: MP4 with H.264 codec
Preprocessing Example
# Convert 4K video to optimal format
ffmpeg -i input_4k.mp4 -vf "scale=640:-2,fps=24" -c:a copy avatar.mp4
Architecture
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Input Text ββββββΆβ StyleTTS2 ββββββΆβ Audio WAV β
βββββββββββββββββββ β (Voice Clone) β ββββββββββ¬βββββββββ
βββββββββββββββββββ β
βββββββββββββββββββ β
β Avatar Video ββββββββββββββββββββββββββββββββββββββββ€
βββββββββββββββββββ β
βββββββββββββββββββ β
β MuseTalk V1.5 ββββββββββββββββ
β (Lip Sync) β
ββββββββββ¬βββββββββ
β
ββββββββββΌβββββββββ
β Output Video β
β (with audio) β
βββββββββββββββββββ
Troubleshooting
"cannot import clear_device_cache"
pip install accelerate==0.25.0 diffusers==0.21.0 huggingface-hub==0.25.0
PyTorch 2.6 pickle error
The scripts include a fix for weights_only parameter. If you see pickle errors, ensure the patch is applied:
import torch
original_load = torch.load
def patched_load(*args, **kwargs):
kwargs['weights_only'] = False
return original_load(*args, **kwargs)
torch.load = patched_load
NLTK punkt error
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
License
This project integrates:
Acknowledgments
- StyleTTS2 team for the amazing TTS model
- TMElyralab for MuseTalk V1.5
- Tested on vast.ai GPU instances
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support