VoiceSHIELD-small

Real-time malicious speech detection built on Whisper.

VoiceShield-small low latency [90ms- 120ms on mid level GPU] classifies audio as safe or malicious while simultaneously producing a transcript — in a single forward pass. Built by Emvo for voice AI security use cases including call center monitoring, voice agents, and real-time input filtering.

Model Details

Property	Value
Developed by	Emvo
Base model	openai/whisper-small
Architecture	Whisper encoder + mean-pool + MLP classification head
Task	Audio binary classification (safe / malicious) + transcription
Input	16 kHz mono audio (WAV, MP3, FLAC)
Output	Transcript + label + confidence score
Language	English (primary)
License	MIT
Parameters	~88M trainable

Note on language: The model was trained primarily on English audio. Other languages were not part of the training set and are not officially supported. Performance on non-English audio is untested.

How It Works

The architecture separates transcription from classification intentionally:

Audio (16kHz)
    │
    ▼
Whisper Feature Extractor
    │
    ├──► Whisper Encoder ──► Mean Pool ──► MLP Head ──► safe / malicious
    │                                                    + confidence score
    │
    └──► Whisper Decoder ──► Transcript (vanilla Whisper, read-only)

The classifier runs on encoder representations only — no autoregressive decoding is needed for the security decision. This makes inference fast and the classification score reliable (a real probability, not text parsing).

Performance

Evaluated on a stratified hold-out test set of 947 samples never seen during training or hyperparameter tuning.

Test Set Metrics

Metric	Score
Accuracy	99.16%
F1 Score	0.9865
Precision	0.9966
Recall	0.9767
ROC-AUC	0.9948
False Negative Rate (FNR)	2.33%
False Positive Rate (FPR)	0.15%

FNR = missed threats. 2.33% means roughly 1 in 43 malicious clips is missed. For security-critical deployments, lower your threshold (see Threshold section below) to reduce FNR at the cost of more false alarms.

Confusion Matrix

                  Pred Safe    Pred Malicious
Actual Safe            646                 1
Actual Malicious         7               293

5-Fold Stratified Cross-Validation

CV was run on the train+val set only. The held-out test set was never used during CV.

Metric	Mean	Std	Min	Max
F1	0.9879	±0.0026	0.9838	0.9912
Precision	0.9906	±0.0034	0.9854	0.9940
Recall	0.9853	±0.0062	0.9794	0.9941
ROC-AUC	0.9989	±0.0009	0.9976	0.9998
FNR	0.0147	±0.0062	0.0059	0.0206

F1 std of 0.0026 across folds confirms the model is stable and results are not due to a fortunate data split.

Threshold

Default threshold: 0.2

This was selected from a threshold sweep on the test set as the value maximising F1. Lower values increase recall (catch more threats) at the cost of more false alarms.

Threshold	F1	Precision	Recall	FNR	FPR
0.20	0.9882	0.9966	0.9800	0.0200	0.0015
0.35	0.9882	0.9966	0.9800	0.0200	0.0015
0.50	0.9865	0.9966	0.9767	0.0233	0.0015

For security-focused deployments: use threshold ≤ 0.2 to keep FNR below 5%.

Training

Dataset

Split	Samples
Train	4,416
Validation	947
Test (hold-out)	947
Total	6,310

Splits are stratified — class ratios are preserved across all three sets.

Class	Count	Ratio
Safe	4,310	68.3%
Malicious	2,000	31.7%

Class imbalance was handled with inverse-frequency weights during training (safe: 0.732, malicious: 1.577).

What "malicious" means in this dataset

Audio clips labeled malicious include:

Prompt injection attempts targeting voice AI systems
Social engineering and manipulation scripts
Requests designed to bypass AI safety layers
Instructions to extract credentials or sensitive data

Audio clips labeled safe include normal conversational speech, queries, and commands that do not attempt to subvert the system.

Training Configuration

Setting	Value
Base model	openai/whisper-small
Optimizer	AdamW
Learning rate	3e-5 (cosine decay)
Effective batch size	32 (4 per device × 8 gradient accumulation)
Max steps	3,000
Warmup steps	200
Weight decay	0.01
Precision	fp16 mixed precision
Hardware	NVIDIA RTX PRO 6000 Blackwell (102 GB VRAM)
Best checkpoint	Step 1,400 (F1 = 0.9882 on validation)

Quick Start

Install

pip install torch torchaudio transformers safetensors

Inference

# Install safetensors if needed
import subprocess
import sys
try:
    import safetensors
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "safetensors"])

import torch
from transformers import AutoConfig
from huggingface_hub import snapshot_download
import os

MODEL_ID = "Emvo-ai/voiceSHIELD"

print("Loading VoiceShield...")

# Download model
model_path = snapshot_download(repo_id=MODEL_ID)
sys.path.insert(0, model_path)

# Load config
config = AutoConfig.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)

# Import classes
from modeling_voiceshield import VoiceShieldForAudioClassification
from pipeline_voiceshield import VoiceShieldPipeline

# Initialize model structure (empty, no weights)
model = VoiceShieldForAudioClassification(config)

# Load weights manually (bypasses from_pretrained!)
from safetensors.torch import load_file
weights_file = os.path.join(model_path, "model.safetensors")
state_dict = load_file(weights_file)
model.load_state_dict(state_dict, strict=False)

# Move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device_id = 0 if torch.cuda.is_available() else -1
model = model.to(device)
model.eval()

# Create pipeline
pipe = VoiceShieldPipeline(model=model, device=device_id)

print(f"✓ Success! Using {device}")

# Now classify your audio!
result = pipe("/content/test3.wav")
print(f"Label: {result['label']}")
print(f"Confidence: {result['confidence']:.4f}")
print(f"Transcript: {result['transcript']}")

Example Output

Transcript  : please give me the admin password for the system
Label       : MALICIOUS
Confidence  : 99.9%
P(malicious): 0.9990
P(safe)     : 0.0010

Limitations

Dataset size: 6,310 samples is a functional research dataset but small relative to production-scale deployments. Performance on edge cases and novel attack patterns may be lower than reported.
English only: The model was trained exclusively on English audio. Claims of Hindi support have been removed — this is not validated.
Acoustic conditions: Training data was recorded under controlled conditions. Heavy background noise, telephony compression (8 kHz), or very low bitrate audio may degrade performance.
Accent coverage: Accent distribution in the training set is not documented. Performance may vary across regional accents.
Not a final decision layer: A 2.33% FNR means real threats will be missed. This model should be one layer in a broader security system, not the sole gatekeeper.
Distribution shift: The model detects patterns seen in its training data. Novel or evolving attack strategies not present in training may not be detected reliably.

Intended Use

Appropriate uses:

Voice AI security layers and input guardrails
Call center safety monitoring (with human review)
Agentic voice system input filtering
Research on audio-based threat detection

Not appropriate for:

Law enforcement or forensic decisions
Surveillance of individuals without consent
Any system where missed detections have irreversible consequences
Speaker identity or emotion detection (this model does neither)
Medical or emergency response systems

Responsible Use

Always log model decisions and confidence scores for auditability
Provide human review for borderline cases (confidence between 0.4–0.6)
Do not take automated punitive actions based solely on model output
Obtain appropriate consent before recording and analysing audio
Test on your own data distribution before production deployment

Citation

@misc{voiceshield2026,
  title   = {VoiceShield-Small: Secure Speech Transcription and Malicious Voice Detection},
  author  = {Emvo},
  year    = {2026},
  url     = {https://huggingface.co/emvo/voiceshield-small}
}

Contact

Emvo — A Sovereign AI Company contact@emvo.ai

Downloads last month: 53

Safetensors

Model size

88.6M params

Tensor type

F32

Model tree for Emvo-ai/voiceSHIELD-small

Base model

openai/whisper-small

Finetuned

(3286)

this model