license: apache-2.0
library_name: onnxruntime
tags:
- onnx
- wav2vec2-a2e
- audio-to-expression
- lip-sync
- blendshapes
- arkit
- wav2vec2
- avatar
- face-animation
- webgpu
- browser
- real-time
pipeline_tag: audio-classification
LAM Audio-to-Expression (ONNX)
ONNX export and quantization of LAM-Audio2Expression by aigc3d/3DAIGC (SIGGRAPH 2025).
Converts raw 16kHz audio into 52 ARKit blendshapes for real-time facial animation and lip sync. Fine-tuned Wav2Vec2 architecture that runs entirely in the browser via ONNX Runtime Web (WebGPU or WASM).
Attribution
| Original model | LAM-Audio2Expression by aigc3d |
| Paper | LAM: Large Avatar Model for One-Shot Animatable Gaussian Head (SIGGRAPH 2025) |
| Architecture | Wav2Vec2-base fine-tuned for A2E with dual-head output (A2E + CTC ASR) |
| License | Apache 2.0 (inherited from upstream) |
| ONNX export | omote-ai β surgical fp16 conversion, external data format |
Model Variants
Recommended: model_fp16.onnx (root)
The recommended default for all platforms. Surgical fp16 conversion that preserves all 32 decomposed LayerNorm subgraphs in fp32 for numerical stability, achieving cosine similarity >0.9999 vs the fp32 reference.
| File | Size | Description |
|---|---|---|
model_fp16.onnx |
385 KB | ONNX graph (operators, topology) |
model_fp16.onnx.data |
192 MB | Weights (fp16, external data) |
This is the variant used by the Omote SDK (createA2E() default).
All Variants
| Variant | Path | Graph | Weights | Total | Quality vs fp32 | Use Case |
|---|---|---|---|---|---|---|
| fp16 (surgical) | model_fp16.onnx |
385 KB | 192 MB | ~192 MB | cosine >0.9999 | Recommended. Desktop WebGPU. |
| fp32 | fp32/model.onnx |
301 KB | 384 MB | ~384 MB | reference | Max quality / debugging |
| fp32 (single-file) | model.onnx |
β | β | 384 MB | reference | Legacy backwards-compat |
| fp16 (naive) | fp16/model.onnx |
411 KB | 192 MB | ~192 MB | cosine ~0.999 | Superseded by root fp16 |
| int8 | int8/model.onnx |
541 KB | 97 MB | ~97 MB | degraded | Not recommended (see below) |
Why not int8? Dynamic int8 quantization reduces size by 75% but produces visibly degraded output for this architecture. Testing across multiple strategies (selective quantization, MatMul-only, QUInt8) all showed cosine similarity <0.99 and magnitude ratios of 0.45β1.21 vs fp32. The Wav2Vec2 transformer weights do not survive 8-bit quantization. fp16 at 192MB is the quality floor.
All external-data variants use a small .onnx graph + large .onnx.data weights file, which enables:
- Fast initial load β 385KB graph loads instantly, heavy weights stream separately
- Efficient caching β graph and weights cached independently in IndexedDB
- iOS compatibility β ORT loads weights directly into WASM memory via URL pass-through, bypassing the JS heap entirely
Quick Start
TypeScript (Omote SDK)
import { createA2E } from '@omote/core';
// Zero-config: auto-detects platform, fetches from this HF repo
const a2e = createA2E();
await a2e.load();
const { blendshapes } = await a2e.infer(audioSamples); // Float32Array[16000] @ 16kHz
// blendshapes: Float32Array[] β 30 frames Γ 52 ARKit weights
Self-host the model files for faster/more reliable delivery:
import { configureModelUrls, createA2E } from '@omote/core';
configureModelUrls({
lam: 'https://cdn.example.com/models/model_fp16.onnx',
// SDK auto-derives model_fp16.onnx.data from the URL
});
const a2e = createA2E();
await a2e.load();
Python (ONNX Runtime)
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession(
"model_fp16.onnx",
providers=["CPUExecutionProvider"]
)
audio = np.random.randn(1, 16000).astype(np.float32) # 1 second at 16kHz
identity = np.zeros((1, 12), dtype=np.float32)
identity[0, 0] = 1.0 # neutral identity
outputs = session.run(None, {"audio": audio, "identity": identity})
blendshapes = outputs[1] # [1, 30, 52] β 30 frames Γ 52 ARKit weights
asr_logits = outputs[0] # [1, 49, 32] β CTC ASR logits
Browser (ONNX Runtime Web, no framework)
import * as ort from 'onnxruntime-web/webgpu';
const session = await ort.InferenceSession.create(
'https://huggingface.co/omote-ai/lam-a2e/resolve/main/model_fp16.onnx',
{
executionProviders: ['webgpu'],
externalData: [{
path: 'model_fp16.onnx.data',
data: 'https://huggingface.co/omote-ai/lam-a2e/resolve/main/model_fp16.onnx.data',
}],
}
);
const audio = new ort.Tensor('float32', new Float32Array(16000), [1, 16000]);
const identity = new ort.Tensor('float32', new Float32Array(12), [1, 12]);
identity.data[0] = 1.0;
const results = await session.run({ audio, identity });
const blendshapes = results.blendshapes; // [1, 30, 52]
Input / Output Specification
Inputs
| Name | Shape | Type | Description |
|---|---|---|---|
audio |
[batch, samples] |
float32 | Raw audio at 16kHz. Use 16000 samples (1 second) for 30fps output. |
identity |
[batch, 12] |
float32 | One-hot identity vector. Use [1,0,...,0] for neutral. 12 identity classes. |
Outputs
| Name | Shape | Type | Description |
|---|---|---|---|
blendshapes |
[batch, 30, 52] |
float32 | ARKit blendshape weights at 30fps. Values in [0, 1]. |
asr_logits |
[batch, 49, 32] |
float32 | CTC ASR logits at 50fps. Small auxiliary head (~24K params). |
ARKit Blendshapes (52)
Output order matches the Apple ARKit standard:
0: eyeBlinkLeft 1: eyeLookDownLeft 2: eyeLookInLeft
3: eyeLookOutLeft 4: eyeLookUpLeft 5: eyeSquintLeft
6: eyeWideLeft 7: eyeBlinkRight 8: eyeLookDownRight
9: eyeLookInRight 10: eyeLookOutRight 11: eyeLookUpRight
12: eyeSquintRight 13: eyeWideRight 14: jawForward
15: jawLeft 16: jawRight 17: jawOpen
18: mouthClose 19: mouthFunnel 20: mouthPucker
21: mouthLeft 22: mouthRight 23: mouthSmileLeft
24: mouthSmileRight 25: mouthFrownLeft 26: mouthFrownRight
27: mouthDimpleLeft 28: mouthDimpleRight 29: mouthStretchLeft
30: mouthStretchRight 31: mouthRollLower 32: mouthRollUpper
33: mouthShrugLower 34: mouthShrugUpper 35: mouthPressLeft
36: mouthPressRight 37: mouthLowerDownLeft 38: mouthLowerDownRight
39: mouthUpperUpLeft 40: mouthUpperUpLeft 41: browDownLeft
42: browDownRight 43: browInnerUp 44: browOuterUpLeft
45: browOuterUpRight 46: cheekPuff 47: cheekSquintLeft
48: cheekSquintRight 49: noseSneerLeft 50: noseSneerRight
51: tongueOut
Platform Recommendations
| Platform | Recommended Variant | Backend | Notes |
|---|---|---|---|
| Chrome / Edge (Desktop) | model_fp16.onnx |
WebGPU | Best performance. 192MB download. |
| Chrome (Android) | model_fp16.onnx |
WebGPU | Same as desktop. |
| Firefox | model_fp16.onnx |
WASM | WebGPU behind flag, WASM works well. |
| Safari / iOS | Use wav2arkit_cpu |
WASM | LAM's graph optimization exceeds iOS memory limits. The Omote SDK handles this automatically via createA2E(). |
Safari/iOS note: The LAM Wav2Vec2 graph requires ~750β950MB peak memory during ONNX Runtime session creation (graph optimization). This exceeds iOS WebKit's ~1β1.5GB tab memory limit. The
wav2arkit_cpumodel (1.86MB graph + 402MB weights, external data format) was designed specifically for this constraint: simpler graph architecture, lower peak optimization memory, external data split so ORT streams weights directly into WASM memory.
Technical Details
Surgical fp16 Conversion
The standard onnxconverter-common fp16 conversion produces visibly subdued blendshape output on WebGPU, despite passing cosine similarity checks. This is because the model has zero LayerNormalization ops β all 32 instances are decomposed into primitive operations:
ReduceMean β Sub β Pow β ReduceMean β Add(epsilon) β Sqrt β Div β Mul(gamma) β Add(beta)
Standard fp16 converters offer op_block_list=["LayerNormalization"] to keep LayerNorm in fp32, but this matched nothing in the decomposed graph. Our surgical conversion:
- Pattern-matches all 32 decomposed LayerNorm subgraphs in the ONNX graph
- Keeps the entire LN computation chain (9 ops each, 288 total) plus gamma/beta weights in fp32
- Converts everything else (attention, feed-forward, convolutions) to fp16
- Inserts explicit
Castnodes at fp32βfp16 boundaries for correct type propagation
Result: 192MB (same as naive fp16) but with cosine >0.9999 and magnitude ratio 0.998β1.002 vs fp32 reference across all test inputs.
Model Architecture
| Property | Value |
|---|---|
| Parameters | 100.5M |
| Architecture | Wav2Vec2-base + dual-head (A2E + CTC ASR) |
| ONNX Opset | 14 (ai.onnx) |
| Sample Rate | 16 kHz |
| A2E Output FPS | 30 |
| ASR Output FPS | 50 |
| Blendshape Standard | Apple ARKit (52 shapes) |
| Identity Classes | 12 |
| ASR Head | ~24K params (< 0.1% of model, CTC vocabulary size 32) |
| Min ORT Version | 1.17.0 (external data support) |
Quantization Results
Comprehensive testing of int8 quantization strategies on this architecture:
| Strategy | Size | Cosine vs fp32 | Mag Ratio | Verdict |
|---|---|---|---|---|
| fp16 surgical (Tier 2) | 192 MB | >0.9999 | 0.998β1.002 | Pass |
| int8 naive | 97 MB | 0.937β0.973 | 0.45β1.21 | Fail |
| int8 exclude LayerNorm | 97 MB | 0.937β0.973 | 0.45β1.21 | Fail |
| int8 MatMul-only | 139 MB | <0.99 | variable | Fail |
| int8 QUInt8 | 97 MB | <0.99 | variable | Fail |
The Wav2Vec2 transformer weights are highly sensitive to quantization. fp16 at 192MB is the minimum size that preserves output quality.
Related Models
| Model | Repo | Use Case |
|---|---|---|
| wav2arkit_cpu | myned-ai/wav2arkit_cpu |
Safari/iOS A2E fallback (WASM, 404MB) |
| SenseVoice ASR | omote-ai/sensevoice-asr |
Speech recognition + emotion (WASM, 228MB int8) |
| Silero VAD | deepghs/silero-vad-onnx |
Voice activity detection (~2MB) |
License
Apache 2.0 β inherited from aigc3d/LAM_Audio2Expression.
This repository contains ONNX export artifacts (surgical fp16 conversion, external data format). No model weights were modified beyond precision conversion. All credit for the model architecture and training belongs to the original authors.