lam-a2e / README.md

sepehrn

docs: comprehensive model card with surgical fp16 details, int8 findings, platform guide

252eb54 verified 1 day ago

preview code

raw

history blame contribute delete

11.9 kB

metadata

license: apache-2.0
library_name: onnxruntime
tags:
  - onnx
  - wav2vec2-a2e
  - audio-to-expression
  - lip-sync
  - blendshapes
  - arkit
  - wav2vec2
  - avatar
  - face-animation
  - webgpu
  - browser
  - real-time
pipeline_tag: audio-classification

LAM Audio-to-Expression (ONNX)

ONNX export and quantization of LAM-Audio2Expression by aigc3d/3DAIGC (SIGGRAPH 2025).

Converts raw 16kHz audio into 52 ARKit blendshapes for real-time facial animation and lip sync. Fine-tuned Wav2Vec2 architecture that runs entirely in the browser via ONNX Runtime Web (WebGPU or WASM).

Attribution


Original model	LAM-Audio2Expression by aigc3d
Paper	LAM: Large Avatar Model for One-Shot Animatable Gaussian Head (SIGGRAPH 2025)
Architecture	Wav2Vec2-base fine-tuned for A2E with dual-head output (A2E + CTC ASR)
License	Apache 2.0 (inherited from upstream)
ONNX export	omote-ai — surgical fp16 conversion, external data format

Model Variants

Recommended: `model_fp16.onnx` (root)

The recommended default for all platforms. Surgical fp16 conversion that preserves all 32 decomposed LayerNorm subgraphs in fp32 for numerical stability, achieving cosine similarity >0.9999 vs the fp32 reference.

File	Size	Description
`model_fp16.onnx`	385 KB	ONNX graph (operators, topology)
`model_fp16.onnx.data`	192 MB	Weights (fp16, external data)

This is the variant used by the Omote SDK (createA2E() default).

All Variants

Variant	Path	Graph	Weights	Total	Quality vs fp32	Use Case
fp16 (surgical)	`model_fp16.onnx`	385 KB	192 MB	~192 MB	cosine >0.9999	Recommended. Desktop WebGPU.
fp32	`fp32/model.onnx`	301 KB	384 MB	~384 MB	reference	Max quality / debugging
fp32 (single-file)	`model.onnx`	—	—	384 MB	reference	Legacy backwards-compat
fp16 (naive)	`fp16/model.onnx`	411 KB	192 MB	~192 MB	cosine ~0.999	Superseded by root fp16
int8	`int8/model.onnx`	541 KB	97 MB	~97 MB	degraded	Not recommended (see below)

Why not int8? Dynamic int8 quantization reduces size by 75% but produces visibly degraded output for this architecture. Testing across multiple strategies (selective quantization, MatMul-only, QUInt8) all showed cosine similarity <0.99 and magnitude ratios of 0.45–1.21 vs fp32. The Wav2Vec2 transformer weights do not survive 8-bit quantization. fp16 at 192MB is the quality floor.

All external-data variants use a small .onnx graph + large .onnx.data weights file, which enables:

Fast initial load — 385KB graph loads instantly, heavy weights stream separately
Efficient caching — graph and weights cached independently in IndexedDB
iOS compatibility — ORT loads weights directly into WASM memory via URL pass-through, bypassing the JS heap entirely

Quick Start

TypeScript (Omote SDK)

import { createA2E } from '@omote/core';

// Zero-config: auto-detects platform, fetches from this HF repo
const a2e = createA2E();
await a2e.load();

const { blendshapes } = await a2e.infer(audioSamples); // Float32Array[16000] @ 16kHz
// blendshapes: Float32Array[] — 30 frames × 52 ARKit weights

Self-host the model files for faster/more reliable delivery:

import { configureModelUrls, createA2E } from '@omote/core';

configureModelUrls({
  lam: 'https://cdn.example.com/models/model_fp16.onnx',
  // SDK auto-derives model_fp16.onnx.data from the URL
});

const a2e = createA2E();
await a2e.load();

Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession(
    "model_fp16.onnx",
    providers=["CPUExecutionProvider"]
)

audio = np.random.randn(1, 16000).astype(np.float32)  # 1 second at 16kHz
identity = np.zeros((1, 12), dtype=np.float32)
identity[0, 0] = 1.0  # neutral identity

outputs = session.run(None, {"audio": audio, "identity": identity})
blendshapes = outputs[1]  # [1, 30, 52] — 30 frames × 52 ARKit weights
asr_logits = outputs[0]   # [1, 49, 32] — CTC ASR logits

Browser (ONNX Runtime Web, no framework)

import * as ort from 'onnxruntime-web/webgpu';

const session = await ort.InferenceSession.create(
  'https://huggingface.co/omote-ai/lam-a2e/resolve/main/model_fp16.onnx',
  {
    executionProviders: ['webgpu'],
    externalData: [{
      path: 'model_fp16.onnx.data',
      data: 'https://huggingface.co/omote-ai/lam-a2e/resolve/main/model_fp16.onnx.data',
    }],
  }
);

const audio = new ort.Tensor('float32', new Float32Array(16000), [1, 16000]);
const identity = new ort.Tensor('float32', new Float32Array(12), [1, 12]);
identity.data[0] = 1.0;

const results = await session.run({ audio, identity });
const blendshapes = results.blendshapes; // [1, 30, 52]

Input / Output Specification

Inputs

Name	Shape	Type	Description
`audio`	`[batch, samples]`	float32	Raw audio at 16kHz. Use 16000 samples (1 second) for 30fps output.
`identity`	`[batch, 12]`	float32	One-hot identity vector. Use `[1,0,...,0]` for neutral. 12 identity classes.

Outputs

Name	Shape	Type	Description
`blendshapes`	`[batch, 30, 52]`	float32	ARKit blendshape weights at 30fps. Values in `[0, 1]`.
`asr_logits`	`[batch, 49, 32]`	float32	CTC ASR logits at 50fps. Small auxiliary head (~24K params).

ARKit Blendshapes (52)

Output order matches the Apple ARKit standard:

 0: eyeBlinkLeft         1: eyeLookDownLeft      2: eyeLookInLeft
 3: eyeLookOutLeft        4: eyeLookUpLeft        5: eyeSquintLeft
 6: eyeWideLeft           7: eyeBlinkRight        8: eyeLookDownRight
 9: eyeLookInRight       10: eyeLookOutRight     11: eyeLookUpRight
12: eyeSquintRight       13: eyeWideRight        14: jawForward
15: jawLeft              16: jawRight            17: jawOpen
18: mouthClose           19: mouthFunnel         20: mouthPucker
21: mouthLeft            22: mouthRight          23: mouthSmileLeft
24: mouthSmileRight      25: mouthFrownLeft      26: mouthFrownRight
27: mouthDimpleLeft      28: mouthDimpleRight    29: mouthStretchLeft
30: mouthStretchRight    31: mouthRollLower      32: mouthRollUpper
33: mouthShrugLower      34: mouthShrugUpper     35: mouthPressLeft
36: mouthPressRight      37: mouthLowerDownLeft  38: mouthLowerDownRight
39: mouthUpperUpLeft     40: mouthUpperUpLeft    41: browDownLeft
42: browDownRight        43: browInnerUp         44: browOuterUpLeft
45: browOuterUpRight     46: cheekPuff           47: cheekSquintLeft
48: cheekSquintRight     49: noseSneerLeft       50: noseSneerRight
51: tongueOut

Platform Recommendations

Platform	Recommended Variant	Backend	Notes
Chrome / Edge (Desktop)	`model_fp16.onnx`	WebGPU	Best performance. 192MB download.
Chrome (Android)	`model_fp16.onnx`	WebGPU	Same as desktop.
Firefox	`model_fp16.onnx`	WASM	WebGPU behind flag, WASM works well.
Safari / iOS	Use `wav2arkit_cpu`	WASM	LAM's graph optimization exceeds iOS memory limits. The Omote SDK handles this automatically via `createA2E()`.

Safari/iOS note: The LAM Wav2Vec2 graph requires ~750–950MB peak memory during ONNX Runtime session creation (graph optimization). This exceeds iOS WebKit's ~1–1.5GB tab memory limit. The wav2arkit_cpu model (1.86MB graph + 402MB weights, external data format) was designed specifically for this constraint: simpler graph architecture, lower peak optimization memory, external data split so ORT streams weights directly into WASM memory.

Technical Details

Surgical fp16 Conversion

The standard onnxconverter-common fp16 conversion produces visibly subdued blendshape output on WebGPU, despite passing cosine similarity checks. This is because the model has zero LayerNormalization ops — all 32 instances are decomposed into primitive operations:

ReduceMean → Sub → Pow → ReduceMean → Add(epsilon) → Sqrt → Div → Mul(gamma) → Add(beta)

Standard fp16 converters offer op_block_list=["LayerNormalization"] to keep LayerNorm in fp32, but this matched nothing in the decomposed graph. Our surgical conversion:

Pattern-matches all 32 decomposed LayerNorm subgraphs in the ONNX graph
Keeps the entire LN computation chain (9 ops each, 288 total) plus gamma/beta weights in fp32
Converts everything else (attention, feed-forward, convolutions) to fp16
Inserts explicit Cast nodes at fp32↔fp16 boundaries for correct type propagation

Result: 192MB (same as naive fp16) but with cosine >0.9999 and magnitude ratio 0.998–1.002 vs fp32 reference across all test inputs.

Model Architecture

Property	Value
Parameters	100.5M
Architecture	Wav2Vec2-base + dual-head (A2E + CTC ASR)
ONNX Opset	14 (ai.onnx)
Sample Rate	16 kHz
A2E Output FPS	30
ASR Output FPS	50
Blendshape Standard	Apple ARKit (52 shapes)
Identity Classes	12
ASR Head	~24K params (< 0.1% of model, CTC vocabulary size 32)
Min ORT Version	1.17.0 (external data support)

Quantization Results

Comprehensive testing of int8 quantization strategies on this architecture:

Strategy	Size	Cosine vs fp32	Mag Ratio	Verdict
fp16 surgical (Tier 2)	192 MB	>0.9999	0.998–1.002	Pass
int8 naive	97 MB	0.937–0.973	0.45–1.21	Fail
int8 exclude LayerNorm	97 MB	0.937–0.973	0.45–1.21	Fail
int8 MatMul-only	139 MB	<0.99	variable	Fail
int8 QUInt8	97 MB	<0.99	variable	Fail

The Wav2Vec2 transformer weights are highly sensitive to quantization. fp16 at 192MB is the minimum size that preserves output quality.

Related Models

Model	Repo	Use Case
wav2arkit_cpu	`myned-ai/wav2arkit_cpu`	Safari/iOS A2E fallback (WASM, 404MB)
SenseVoice ASR	`omote-ai/sensevoice-asr`	Speech recognition + emotion (WASM, 228MB int8)
Silero VAD	`deepghs/silero-vad-onnx`	Voice activity detection (~2MB)

License

Apache 2.0 — inherited from aigc3d/LAM_Audio2Expression.

This repository contains ONNX export artifacts (surgical fp16 conversion, external data format). No model weights were modified beyond precision conversion. All credit for the model architecture and training belongs to the original authors.