LAM Audio-to-Expression (ONNX)

ONNX export and quantization of LAM-Audio2Expression by aigc3d/3DAIGC (SIGGRAPH 2025).

Converts raw 16kHz audio into 52 ARKit blendshapes for real-time facial animation and lip sync. Fine-tuned Wav2Vec2 architecture that runs entirely in the browser via ONNX Runtime Web.

Attribution

Original model: LAM-Audio2Expression by aigc3d
Paper: LAM: Large Avatar Model for One-Shot Animatable Gaussian Head (SIGGRAPH 2025)
Architecture: Wav2Vec2-base fine-tuned for audio-to-expression with dual-head output (A2E + CTC ASR)
License: Apache 2.0 (inherited from upstream)
Export by: omote-ai — quantization, external data format, browser optimization

Variants

Variant	Graph	Weights	Total	Use Case
fp16	411 KB	192 MB	~192 MB	Desktop WebGPU (recommended default)
fp32	301 KB	384 MB	~384 MB	Reference / max quality
int8	541 KB	96 MB	~97 MB	Mobile / low-bandwidth (WASM only)
single-file	—	—	384 MB	Legacy backwards-compat (fp32)

All variants use external data format (small .onnx graph + large .onnx.data weights file), which enables:

iOS URL pass-through (ORT loads weights directly into WASM memory, bypassing JS heap)
Efficient caching (graph and weights cached separately)
Streaming weight loading

Note: The int8 variant uses dynamic quantization (QInt8 weights, fp32 activations). It is recommended for WASM/CPU only — WebGPU has limited int8 operator support and may fall back to CPU silently.

Quick Start

TypeScript (Omote SDK)

import { createA2E } from '@omote/core';

// Auto-detects platform: WebGPU (desktop) or wav2arkit_cpu (Safari/iOS)
const a2e = createA2E();
await a2e.load();

const { blendshapes } = await a2e.infer(audioSamples); // Float32Array[16000]
// blendshapes: Float32Array[] — 30 frames × 52 ARKit weights

Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("fp16/model.onnx", providers=["CPUExecutionProvider"])

audio = np.random.randn(1, 16000).astype(np.float32)  # 1 second at 16kHz
identity = np.zeros((1, 12), dtype=np.float32)
identity[0, 0] = 1.0  # neutral identity

outputs = session.run(None, {"audio": audio, "identity": identity})
# outputs[0]: asr_logits [1, 49, 32]
# outputs[1]: blendshapes [1, 30, 52]

Browser (ONNX Runtime Web)

import * as ort from 'onnxruntime-web/webgpu';

const session = await ort.InferenceSession.create(
  'https://huggingface.co/omote-ai/lam-a2e/resolve/main/fp16/model.onnx',
  {
    executionProviders: ['webgpu'],
    externalData: [{
      path: 'model.onnx.data',
      data: 'https://huggingface.co/omote-ai/lam-a2e/resolve/main/fp16/model.onnx.data',
    }],
  }
);

const audio = new ort.Tensor('float32', new Float32Array(16000), [1, 16000]);
const identity = new ort.Tensor('float32', new Float32Array(12), [1, 12]);
const results = await session.run({ audio, identity });
const blendshapes = results.blendshapes; // [1, 30, 52]

Input / Output Specification

Inputs

Name	Shape	Type	Description
`audio`	`[batch, samples]`	float32	Raw audio at 16kHz. Use 16000 samples (1 second) for 30fps output.
`identity`	`[batch, 12]`	float32	One-hot identity vector. Use `[1,0,...,0]` for neutral.

Outputs

Name	Shape	Type	Description
`blendshapes`	`[batch, time_a2e, 52]`	float32	ARKit blendshape weights (30fps). Values in [0, 1].
`asr_logits`	`[batch, time_asr, 32]`	float32	CTC ASR logits (50fps). Small head, ~24K params.

ARKit Blendshapes (52)

Output order matches the Apple ARKit standard:

 0: eyeBlinkLeft        1: eyeLookDownLeft     2: eyeLookInLeft
 3: eyeLookOutLeft       4: eyeLookUpLeft       5: eyeSquintLeft
 6: eyeWideLeft          7: eyeBlinkRight       8: eyeLookDownRight
 9: eyeLookInRight      10: eyeLookOutRight    11: eyeLookUpRight
12: eyeSquintRight      13: eyeWideRight       14: jawForward
15: jawLeft             16: jawRight           17: jawOpen
18: mouthClose          19: mouthFunnel        20: mouthPucker
21: mouthLeft           22: mouthRight         23: mouthSmileLeft
24: mouthSmileRight     25: mouthFrownLeft     26: mouthFrownRight
27: mouthDimpleLeft     28: mouthDimpleRight   29: mouthStretchLeft
30: mouthStretchRight   31: mouthRollLower     32: mouthRollUpper
33: mouthShrugLower     34: mouthShrugUpper    35: mouthPressLeft
36: mouthPressRight     37: mouthLowerDownLeft 38: mouthLowerDownRight
39: mouthUpperUpLeft    40: mouthUpperUpRight  41: browDownLeft
42: browDownRight       43: browInnerUp        44: browOuterUpLeft
45: browOuterUpRight    46: cheekPuff          47: cheekSquintLeft
48: cheekSquintRight    49: noseSneerLeft      50: noseSneerRight
51: tongueOut

Platform Recommendations

Platform	Variant	Backend	Notes
Chrome/Edge (Desktop)	fp16	WebGPU	Recommended default. 50% smaller, same quality.
Chrome (Android)	fp16	WebGPU	Same as desktop.
Firefox	fp32 or fp16	WASM	WebGPU behind flag.
Safari (macOS/iOS)	Use wav2arkit_cpu	WASM	LAM's graph optimization exceeds iOS memory limits.
Low-bandwidth	int8	WASM	75% smaller. WASM only — limited WebGPU int8 support.

Model Details

Property	Value
Parameters	100.5M
Architecture	Wav2Vec2-base + dual-head (A2E + CTC ASR)
ONNX Opset	14
Sample Rate	16kHz
Output FPS	30 (A2E) / 50 (ASR)
Blendshape Standard	Apple ARKit (52)
Training Data	Not disclosed by upstream
Min ORT Version	1.17.0 (external data support)

License

Apache 2.0 — inherited from aigc3d/LAM_Audio2Expression.

This repository contains only ONNX export artifacts (quantization, external data format conversion). No model weights were modified beyond precision conversion. All credit for the model architecture and training belongs to the original authors.

Downloads last month: -