LAM Audio-to-Expression (ONNX)
ONNX export and quantization of LAM-Audio2Expression by aigc3d/3DAIGC (SIGGRAPH 2025).
Converts raw 16kHz audio into 52 ARKit blendshapes for real-time facial animation and lip sync. Fine-tuned Wav2Vec2 architecture that runs entirely in the browser via ONNX Runtime Web.
Attribution
- Original model: LAM-Audio2Expression by aigc3d
- Paper: LAM: Large Avatar Model for One-Shot Animatable Gaussian Head (SIGGRAPH 2025)
- Architecture: Wav2Vec2-base fine-tuned for audio-to-expression with dual-head output (A2E + CTC ASR)
- License: Apache 2.0 (inherited from upstream)
- Export by: omote-ai β quantization, external data format, browser optimization
Variants
| Variant | Graph | Weights | Total | Use Case |
|---|---|---|---|---|
| fp16 | 411 KB | 192 MB | ~192 MB | Desktop WebGPU (recommended default) |
| fp32 | 301 KB | 384 MB | ~384 MB | Reference / max quality |
| int8 | 541 KB | 96 MB | ~97 MB | Mobile / low-bandwidth (WASM only) |
| single-file | β | β | 384 MB | Legacy backwards-compat (fp32) |
All variants use external data format (small .onnx graph + large .onnx.data weights file), which enables:
- iOS URL pass-through (ORT loads weights directly into WASM memory, bypassing JS heap)
- Efficient caching (graph and weights cached separately)
- Streaming weight loading
Note: The int8 variant uses dynamic quantization (QInt8 weights, fp32 activations). It is recommended for WASM/CPU only β WebGPU has limited int8 operator support and may fall back to CPU silently.
Quick Start
TypeScript (Omote SDK)
import { createA2E } from '@omote/core';
// Auto-detects platform: WebGPU (desktop) or wav2arkit_cpu (Safari/iOS)
const a2e = createA2E();
await a2e.load();
const { blendshapes } = await a2e.infer(audioSamples); // Float32Array[16000]
// blendshapes: Float32Array[] β 30 frames Γ 52 ARKit weights
Python (ONNX Runtime)
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("fp16/model.onnx", providers=["CPUExecutionProvider"])
audio = np.random.randn(1, 16000).astype(np.float32) # 1 second at 16kHz
identity = np.zeros((1, 12), dtype=np.float32)
identity[0, 0] = 1.0 # neutral identity
outputs = session.run(None, {"audio": audio, "identity": identity})
# outputs[0]: asr_logits [1, 49, 32]
# outputs[1]: blendshapes [1, 30, 52]
Browser (ONNX Runtime Web)
import * as ort from 'onnxruntime-web/webgpu';
const session = await ort.InferenceSession.create(
'https://huggingface.co/omote-ai/lam-a2e/resolve/main/fp16/model.onnx',
{
executionProviders: ['webgpu'],
externalData: [{
path: 'model.onnx.data',
data: 'https://huggingface.co/omote-ai/lam-a2e/resolve/main/fp16/model.onnx.data',
}],
}
);
const audio = new ort.Tensor('float32', new Float32Array(16000), [1, 16000]);
const identity = new ort.Tensor('float32', new Float32Array(12), [1, 12]);
const results = await session.run({ audio, identity });
const blendshapes = results.blendshapes; // [1, 30, 52]
Input / Output Specification
Inputs
| Name | Shape | Type | Description |
|---|---|---|---|
audio |
[batch, samples] |
float32 | Raw audio at 16kHz. Use 16000 samples (1 second) for 30fps output. |
identity |
[batch, 12] |
float32 | One-hot identity vector. Use [1,0,...,0] for neutral. |
Outputs
| Name | Shape | Type | Description |
|---|---|---|---|
blendshapes |
[batch, time_a2e, 52] |
float32 | ARKit blendshape weights (30fps). Values in [0, 1]. |
asr_logits |
[batch, time_asr, 32] |
float32 | CTC ASR logits (50fps). Small head, ~24K params. |
ARKit Blendshapes (52)
Output order matches the Apple ARKit standard:
0: eyeBlinkLeft 1: eyeLookDownLeft 2: eyeLookInLeft
3: eyeLookOutLeft 4: eyeLookUpLeft 5: eyeSquintLeft
6: eyeWideLeft 7: eyeBlinkRight 8: eyeLookDownRight
9: eyeLookInRight 10: eyeLookOutRight 11: eyeLookUpRight
12: eyeSquintRight 13: eyeWideRight 14: jawForward
15: jawLeft 16: jawRight 17: jawOpen
18: mouthClose 19: mouthFunnel 20: mouthPucker
21: mouthLeft 22: mouthRight 23: mouthSmileLeft
24: mouthSmileRight 25: mouthFrownLeft 26: mouthFrownRight
27: mouthDimpleLeft 28: mouthDimpleRight 29: mouthStretchLeft
30: mouthStretchRight 31: mouthRollLower 32: mouthRollUpper
33: mouthShrugLower 34: mouthShrugUpper 35: mouthPressLeft
36: mouthPressRight 37: mouthLowerDownLeft 38: mouthLowerDownRight
39: mouthUpperUpLeft 40: mouthUpperUpRight 41: browDownLeft
42: browDownRight 43: browInnerUp 44: browOuterUpLeft
45: browOuterUpRight 46: cheekPuff 47: cheekSquintLeft
48: cheekSquintRight 49: noseSneerLeft 50: noseSneerRight
51: tongueOut
Platform Recommendations
| Platform | Variant | Backend | Notes |
|---|---|---|---|
| Chrome/Edge (Desktop) | fp16 | WebGPU | Recommended default. 50% smaller, same quality. |
| Chrome (Android) | fp16 | WebGPU | Same as desktop. |
| Firefox | fp32 or fp16 | WASM | WebGPU behind flag. |
| Safari (macOS/iOS) | Use wav2arkit_cpu | WASM | LAM's graph optimization exceeds iOS memory limits. |
| Low-bandwidth | int8 | WASM | 75% smaller. WASM only β limited WebGPU int8 support. |
Model Details
| Property | Value |
|---|---|
| Parameters | 100.5M |
| Architecture | Wav2Vec2-base + dual-head (A2E + CTC ASR) |
| ONNX Opset | 14 |
| Sample Rate | 16kHz |
| Output FPS | 30 (A2E) / 50 (ASR) |
| Blendshape Standard | Apple ARKit (52) |
| Training Data | Not disclosed by upstream |
| Min ORT Version | 1.17.0 (external data support) |
License
Apache 2.0 β inherited from aigc3d/LAM_Audio2Expression.
This repository contains only ONNX export artifacts (quantization, external data format conversion). No model weights were modified beyond precision conversion. All credit for the model architecture and training belongs to the original authors.
- Downloads last month
- -