LAM Audio-to-Expression (ONNX)

ONNX export and quantization of LAM-Audio2Expression by aigc3d/3DAIGC (SIGGRAPH 2025).

Converts raw 16kHz audio into 52 ARKit blendshapes for real-time facial animation and lip sync. Fine-tuned Wav2Vec2 architecture that runs entirely in the browser via ONNX Runtime Web.

Attribution

  • Original model: LAM-Audio2Expression by aigc3d
  • Paper: LAM: Large Avatar Model for One-Shot Animatable Gaussian Head (SIGGRAPH 2025)
  • Architecture: Wav2Vec2-base fine-tuned for audio-to-expression with dual-head output (A2E + CTC ASR)
  • License: Apache 2.0 (inherited from upstream)
  • Export by: omote-ai β€” quantization, external data format, browser optimization

Variants

Variant Graph Weights Total Use Case
fp16 411 KB 192 MB ~192 MB Desktop WebGPU (recommended default)
fp32 301 KB 384 MB ~384 MB Reference / max quality
int8 541 KB 96 MB ~97 MB Mobile / low-bandwidth (WASM only)
single-file β€” β€” 384 MB Legacy backwards-compat (fp32)

All variants use external data format (small .onnx graph + large .onnx.data weights file), which enables:

  • iOS URL pass-through (ORT loads weights directly into WASM memory, bypassing JS heap)
  • Efficient caching (graph and weights cached separately)
  • Streaming weight loading

Note: The int8 variant uses dynamic quantization (QInt8 weights, fp32 activations). It is recommended for WASM/CPU only β€” WebGPU has limited int8 operator support and may fall back to CPU silently.

Quick Start

TypeScript (Omote SDK)

import { createA2E } from '@omote/core';

// Auto-detects platform: WebGPU (desktop) or wav2arkit_cpu (Safari/iOS)
const a2e = createA2E();
await a2e.load();

const { blendshapes } = await a2e.infer(audioSamples); // Float32Array[16000]
// blendshapes: Float32Array[] β€” 30 frames Γ— 52 ARKit weights

Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("fp16/model.onnx", providers=["CPUExecutionProvider"])

audio = np.random.randn(1, 16000).astype(np.float32)  # 1 second at 16kHz
identity = np.zeros((1, 12), dtype=np.float32)
identity[0, 0] = 1.0  # neutral identity

outputs = session.run(None, {"audio": audio, "identity": identity})
# outputs[0]: asr_logits [1, 49, 32]
# outputs[1]: blendshapes [1, 30, 52]

Browser (ONNX Runtime Web)

import * as ort from 'onnxruntime-web/webgpu';

const session = await ort.InferenceSession.create(
  'https://huggingface.co/omote-ai/lam-a2e/resolve/main/fp16/model.onnx',
  {
    executionProviders: ['webgpu'],
    externalData: [{
      path: 'model.onnx.data',
      data: 'https://huggingface.co/omote-ai/lam-a2e/resolve/main/fp16/model.onnx.data',
    }],
  }
);

const audio = new ort.Tensor('float32', new Float32Array(16000), [1, 16000]);
const identity = new ort.Tensor('float32', new Float32Array(12), [1, 12]);
const results = await session.run({ audio, identity });
const blendshapes = results.blendshapes; // [1, 30, 52]

Input / Output Specification

Inputs

Name Shape Type Description
audio [batch, samples] float32 Raw audio at 16kHz. Use 16000 samples (1 second) for 30fps output.
identity [batch, 12] float32 One-hot identity vector. Use [1,0,...,0] for neutral.

Outputs

Name Shape Type Description
blendshapes [batch, time_a2e, 52] float32 ARKit blendshape weights (30fps). Values in [0, 1].
asr_logits [batch, time_asr, 32] float32 CTC ASR logits (50fps). Small head, ~24K params.

ARKit Blendshapes (52)

Output order matches the Apple ARKit standard:

 0: eyeBlinkLeft        1: eyeLookDownLeft     2: eyeLookInLeft
 3: eyeLookOutLeft       4: eyeLookUpLeft       5: eyeSquintLeft
 6: eyeWideLeft          7: eyeBlinkRight       8: eyeLookDownRight
 9: eyeLookInRight      10: eyeLookOutRight    11: eyeLookUpRight
12: eyeSquintRight      13: eyeWideRight       14: jawForward
15: jawLeft             16: jawRight           17: jawOpen
18: mouthClose          19: mouthFunnel        20: mouthPucker
21: mouthLeft           22: mouthRight         23: mouthSmileLeft
24: mouthSmileRight     25: mouthFrownLeft     26: mouthFrownRight
27: mouthDimpleLeft     28: mouthDimpleRight   29: mouthStretchLeft
30: mouthStretchRight   31: mouthRollLower     32: mouthRollUpper
33: mouthShrugLower     34: mouthShrugUpper    35: mouthPressLeft
36: mouthPressRight     37: mouthLowerDownLeft 38: mouthLowerDownRight
39: mouthUpperUpLeft    40: mouthUpperUpRight  41: browDownLeft
42: browDownRight       43: browInnerUp        44: browOuterUpLeft
45: browOuterUpRight    46: cheekPuff          47: cheekSquintLeft
48: cheekSquintRight    49: noseSneerLeft      50: noseSneerRight
51: tongueOut

Platform Recommendations

Platform Variant Backend Notes
Chrome/Edge (Desktop) fp16 WebGPU Recommended default. 50% smaller, same quality.
Chrome (Android) fp16 WebGPU Same as desktop.
Firefox fp32 or fp16 WASM WebGPU behind flag.
Safari (macOS/iOS) Use wav2arkit_cpu WASM LAM's graph optimization exceeds iOS memory limits.
Low-bandwidth int8 WASM 75% smaller. WASM only β€” limited WebGPU int8 support.

Model Details

Property Value
Parameters 100.5M
Architecture Wav2Vec2-base + dual-head (A2E + CTC ASR)
ONNX Opset 14
Sample Rate 16kHz
Output FPS 30 (A2E) / 50 (ASR)
Blendshape Standard Apple ARKit (52)
Training Data Not disclosed by upstream
Min ORT Version 1.17.0 (external data support)

License

Apache 2.0 β€” inherited from aigc3d/LAM_Audio2Expression.

This repository contains only ONNX export artifacts (quantization, external data format conversion). No model weights were modified beyond precision conversion. All credit for the model architecture and training belongs to the original authors.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support