YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SPEAR Large (speech)

This is the SPEAR Large single domain (speech-only) model. The model adopts a Zipformer backbone with 327M model parameters consisting of 11 Zipformer stacks. It generates 1024-dimensional representations at approximately 50 Hz. It achieves state-of-the-art performance on SUPERB benchmark.

The model was pre-trained on 84k hours of unlabelled English speech data from following datasets:

Dataset Duration (hours)
Libriheavy 50,000
Gigaspeech 10,000
VoxPopuli (en) 24,000

Note: The model is pretrained on 16kHz sampled speech data. When using the model, make sure that your speech input is also sampled at 16kHz.

Paper

Authors: Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland

Abstract Self-Supervised Learning (SSL) excels at learning generic representations of acoustic signals, yet prevailing methods remain domain-specific, tailored to either speech or general audio, hindering the development of a unified representation model with a comprehensive capability over both domains. To address this, we present SPEAR (SPEech and Audio Representations), the first SSL framework to successfully learn unified speech and audio representations from a mixture of speech and audio data. SPEAR proposes a unified pre-training objective based on masked prediction of fine-grained discrete tokens for both speech and general audio. These tokens are derived from continuous speech and audio representations using a Multi-codebook Vector Quantisation (MVQ) method, retaining rich acoustic detail essential for modelling both speech and complex audio events. SPEAR is applied to pre-train both single-domain and unified speech-and-audio SSL models. Our speech-domain model establishes a new state-of-the-art on the SUPERB benchmark, a speech processing benchmark for SSL models, matching or surpassing the highly competitive WavLM Large on 12 out of 15 tasks with the same pre-training corpora and a similar model size. Crucially, our unified model learns complementary features and demonstrates comprehensive capabilities across two major benchmarks, SUPERB and HEAR, for evaluating audio representations. By further scaling up the model size and pre-training data, we present a unified model with 600M parameters that excels in both domains, establishing it as one of the most powerful and versatile open-source SSL models for auditory understanding.

Usage

This model is pre-trained purely using unlabelled data. Therefore, it requires fine-tuning with labelled data for downstream tasks such as automatic speech recognition (ASR).

The model achieves the following performance when fine-tuned on LibriSpeech for ASR:

Fine-tuning data test-clean test-other
LS960 1.7 3.3

You can however extract its top-layer feature (and intermediate hidden states) using the following code:

from transformers import AutoModel
import torch

model = AutoModel.from_pretrained(
    "marcoyang/spear-large-speech", 
    trust_remote_code=True,
    force_download=False,
)
if torch.cuda.is_available():
    model = model.to("cuda")
model.eval()

device = next(model.parameters()).device
audio = torch.randn(1, 160000).to(device) # dummy audio input of 10 seconds
audio_len = torch.tensor([160000]).to(device)

with torch.no_grad():
    outputs = model(audio, audio_len)

encoder_out = outputs["encoder_out"] # (N,T,C)
encoder_out_lens = outputs["encoder_out_lens"] # (N)
middle_out = outputs["hidden_states"] # list of (N,T,C)

print(encoder_out)
print(encoder_out_lens)
print(len(middle_out)) # 11 layers
print(middle_out[-1].shape)
print(middle_out[-1])
Downloads last month
6
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including marcoyang/spear-large-speech