lam-a2e / README.md
sepehrn's picture
docs: comprehensive model card with surgical fp16 details, int8 findings, platform guide
252eb54 verified
metadata
license: apache-2.0
library_name: onnxruntime
tags:
  - onnx
  - wav2vec2-a2e
  - audio-to-expression
  - lip-sync
  - blendshapes
  - arkit
  - wav2vec2
  - avatar
  - face-animation
  - webgpu
  - browser
  - real-time
pipeline_tag: audio-classification

LAM Audio-to-Expression (ONNX)

ONNX export and quantization of LAM-Audio2Expression by aigc3d/3DAIGC (SIGGRAPH 2025).

Converts raw 16kHz audio into 52 ARKit blendshapes for real-time facial animation and lip sync. Fine-tuned Wav2Vec2 architecture that runs entirely in the browser via ONNX Runtime Web (WebGPU or WASM).

Attribution

Original model LAM-Audio2Expression by aigc3d
Paper LAM: Large Avatar Model for One-Shot Animatable Gaussian Head (SIGGRAPH 2025)
Architecture Wav2Vec2-base fine-tuned for A2E with dual-head output (A2E + CTC ASR)
License Apache 2.0 (inherited from upstream)
ONNX export omote-ai β€” surgical fp16 conversion, external data format

Model Variants

Recommended: model_fp16.onnx (root)

The recommended default for all platforms. Surgical fp16 conversion that preserves all 32 decomposed LayerNorm subgraphs in fp32 for numerical stability, achieving cosine similarity >0.9999 vs the fp32 reference.

File Size Description
model_fp16.onnx 385 KB ONNX graph (operators, topology)
model_fp16.onnx.data 192 MB Weights (fp16, external data)

This is the variant used by the Omote SDK (createA2E() default).

All Variants

Variant Path Graph Weights Total Quality vs fp32 Use Case
fp16 (surgical) model_fp16.onnx 385 KB 192 MB ~192 MB cosine >0.9999 Recommended. Desktop WebGPU.
fp32 fp32/model.onnx 301 KB 384 MB ~384 MB reference Max quality / debugging
fp32 (single-file) model.onnx β€” β€” 384 MB reference Legacy backwards-compat
fp16 (naive) fp16/model.onnx 411 KB 192 MB ~192 MB cosine ~0.999 Superseded by root fp16
int8 int8/model.onnx 541 KB 97 MB ~97 MB degraded Not recommended (see below)

Why not int8? Dynamic int8 quantization reduces size by 75% but produces visibly degraded output for this architecture. Testing across multiple strategies (selective quantization, MatMul-only, QUInt8) all showed cosine similarity <0.99 and magnitude ratios of 0.45–1.21 vs fp32. The Wav2Vec2 transformer weights do not survive 8-bit quantization. fp16 at 192MB is the quality floor.

All external-data variants use a small .onnx graph + large .onnx.data weights file, which enables:

  • Fast initial load β€” 385KB graph loads instantly, heavy weights stream separately
  • Efficient caching β€” graph and weights cached independently in IndexedDB
  • iOS compatibility β€” ORT loads weights directly into WASM memory via URL pass-through, bypassing the JS heap entirely

Quick Start

TypeScript (Omote SDK)

import { createA2E } from '@omote/core';

// Zero-config: auto-detects platform, fetches from this HF repo
const a2e = createA2E();
await a2e.load();

const { blendshapes } = await a2e.infer(audioSamples); // Float32Array[16000] @ 16kHz
// blendshapes: Float32Array[] β€” 30 frames Γ— 52 ARKit weights

Self-host the model files for faster/more reliable delivery:

import { configureModelUrls, createA2E } from '@omote/core';

configureModelUrls({
  lam: 'https://cdn.example.com/models/model_fp16.onnx',
  // SDK auto-derives model_fp16.onnx.data from the URL
});

const a2e = createA2E();
await a2e.load();

Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession(
    "model_fp16.onnx",
    providers=["CPUExecutionProvider"]
)

audio = np.random.randn(1, 16000).astype(np.float32)  # 1 second at 16kHz
identity = np.zeros((1, 12), dtype=np.float32)
identity[0, 0] = 1.0  # neutral identity

outputs = session.run(None, {"audio": audio, "identity": identity})
blendshapes = outputs[1]  # [1, 30, 52] β€” 30 frames Γ— 52 ARKit weights
asr_logits = outputs[0]   # [1, 49, 32] β€” CTC ASR logits

Browser (ONNX Runtime Web, no framework)

import * as ort from 'onnxruntime-web/webgpu';

const session = await ort.InferenceSession.create(
  'https://huggingface.co/omote-ai/lam-a2e/resolve/main/model_fp16.onnx',
  {
    executionProviders: ['webgpu'],
    externalData: [{
      path: 'model_fp16.onnx.data',
      data: 'https://huggingface.co/omote-ai/lam-a2e/resolve/main/model_fp16.onnx.data',
    }],
  }
);

const audio = new ort.Tensor('float32', new Float32Array(16000), [1, 16000]);
const identity = new ort.Tensor('float32', new Float32Array(12), [1, 12]);
identity.data[0] = 1.0;

const results = await session.run({ audio, identity });
const blendshapes = results.blendshapes; // [1, 30, 52]

Input / Output Specification

Inputs

Name Shape Type Description
audio [batch, samples] float32 Raw audio at 16kHz. Use 16000 samples (1 second) for 30fps output.
identity [batch, 12] float32 One-hot identity vector. Use [1,0,...,0] for neutral. 12 identity classes.

Outputs

Name Shape Type Description
blendshapes [batch, 30, 52] float32 ARKit blendshape weights at 30fps. Values in [0, 1].
asr_logits [batch, 49, 32] float32 CTC ASR logits at 50fps. Small auxiliary head (~24K params).

ARKit Blendshapes (52)

Output order matches the Apple ARKit standard:

 0: eyeBlinkLeft         1: eyeLookDownLeft      2: eyeLookInLeft
 3: eyeLookOutLeft        4: eyeLookUpLeft        5: eyeSquintLeft
 6: eyeWideLeft           7: eyeBlinkRight        8: eyeLookDownRight
 9: eyeLookInRight       10: eyeLookOutRight     11: eyeLookUpRight
12: eyeSquintRight       13: eyeWideRight        14: jawForward
15: jawLeft              16: jawRight            17: jawOpen
18: mouthClose           19: mouthFunnel         20: mouthPucker
21: mouthLeft            22: mouthRight          23: mouthSmileLeft
24: mouthSmileRight      25: mouthFrownLeft      26: mouthFrownRight
27: mouthDimpleLeft      28: mouthDimpleRight    29: mouthStretchLeft
30: mouthStretchRight    31: mouthRollLower      32: mouthRollUpper
33: mouthShrugLower      34: mouthShrugUpper     35: mouthPressLeft
36: mouthPressRight      37: mouthLowerDownLeft  38: mouthLowerDownRight
39: mouthUpperUpLeft     40: mouthUpperUpLeft    41: browDownLeft
42: browDownRight        43: browInnerUp         44: browOuterUpLeft
45: browOuterUpRight     46: cheekPuff           47: cheekSquintLeft
48: cheekSquintRight     49: noseSneerLeft       50: noseSneerRight
51: tongueOut

Platform Recommendations

Platform Recommended Variant Backend Notes
Chrome / Edge (Desktop) model_fp16.onnx WebGPU Best performance. 192MB download.
Chrome (Android) model_fp16.onnx WebGPU Same as desktop.
Firefox model_fp16.onnx WASM WebGPU behind flag, WASM works well.
Safari / iOS Use wav2arkit_cpu WASM LAM's graph optimization exceeds iOS memory limits. The Omote SDK handles this automatically via createA2E().

Safari/iOS note: The LAM Wav2Vec2 graph requires ~750–950MB peak memory during ONNX Runtime session creation (graph optimization). This exceeds iOS WebKit's ~1–1.5GB tab memory limit. The wav2arkit_cpu model (1.86MB graph + 402MB weights, external data format) was designed specifically for this constraint: simpler graph architecture, lower peak optimization memory, external data split so ORT streams weights directly into WASM memory.

Technical Details

Surgical fp16 Conversion

The standard onnxconverter-common fp16 conversion produces visibly subdued blendshape output on WebGPU, despite passing cosine similarity checks. This is because the model has zero LayerNormalization ops β€” all 32 instances are decomposed into primitive operations:

ReduceMean β†’ Sub β†’ Pow β†’ ReduceMean β†’ Add(epsilon) β†’ Sqrt β†’ Div β†’ Mul(gamma) β†’ Add(beta)

Standard fp16 converters offer op_block_list=["LayerNormalization"] to keep LayerNorm in fp32, but this matched nothing in the decomposed graph. Our surgical conversion:

  1. Pattern-matches all 32 decomposed LayerNorm subgraphs in the ONNX graph
  2. Keeps the entire LN computation chain (9 ops each, 288 total) plus gamma/beta weights in fp32
  3. Converts everything else (attention, feed-forward, convolutions) to fp16
  4. Inserts explicit Cast nodes at fp32↔fp16 boundaries for correct type propagation

Result: 192MB (same as naive fp16) but with cosine >0.9999 and magnitude ratio 0.998–1.002 vs fp32 reference across all test inputs.

Model Architecture

Property Value
Parameters 100.5M
Architecture Wav2Vec2-base + dual-head (A2E + CTC ASR)
ONNX Opset 14 (ai.onnx)
Sample Rate 16 kHz
A2E Output FPS 30
ASR Output FPS 50
Blendshape Standard Apple ARKit (52 shapes)
Identity Classes 12
ASR Head ~24K params (< 0.1% of model, CTC vocabulary size 32)
Min ORT Version 1.17.0 (external data support)

Quantization Results

Comprehensive testing of int8 quantization strategies on this architecture:

Strategy Size Cosine vs fp32 Mag Ratio Verdict
fp16 surgical (Tier 2) 192 MB >0.9999 0.998–1.002 Pass
int8 naive 97 MB 0.937–0.973 0.45–1.21 Fail
int8 exclude LayerNorm 97 MB 0.937–0.973 0.45–1.21 Fail
int8 MatMul-only 139 MB <0.99 variable Fail
int8 QUInt8 97 MB <0.99 variable Fail

The Wav2Vec2 transformer weights are highly sensitive to quantization. fp16 at 192MB is the minimum size that preserves output quality.

Related Models

Model Repo Use Case
wav2arkit_cpu myned-ai/wav2arkit_cpu Safari/iOS A2E fallback (WASM, 404MB)
SenseVoice ASR omote-ai/sensevoice-asr Speech recognition + emotion (WASM, 228MB int8)
Silero VAD deepghs/silero-vad-onnx Voice activity detection (~2MB)

License

Apache 2.0 β€” inherited from aigc3d/LAM_Audio2Expression.

This repository contains ONNX export artifacts (surgical fp16 conversion, external data format). No model weights were modified beyond precision conversion. All credit for the model architecture and training belongs to the original authors.