Post
1270
After running extensive benchmarks across ASR, TTS, and VAD on Apple Silicon, we found some results that weren't documented anywhere.
The most counterintuitive: INT8 runs 3.3x faster than INT4 on the Neural Engine. A 332 MB CoreML model allocates 1,677 MB at runtime. And the right architecture uses both MLX and CoreML simultaneously — not one or the other.
MLX talks to the GPU — programmable, fast for large transformer inference. CoreML talks to the Neural Engine — fixed-function silicon, 135x real-time for small feedforward models like VAD, near-zero power draw.
All benchmarks are from speech-swift, our open-source Swift library for on-device speech AI: ASR, TTS, VAD, diarization, speech-to-speech — everything running locally on Apple Silicon with no API, no cloud, no data leaving the device.
Models on HF: aufklarer/Qwen3-ASR-0.6B-MLX-4bit · aufklarer/parakeet-tdt-0.6b-coreml-int8 · aufklarer/PersonaPlex-7B-MLX-4bit
Full article: https://blog.ivan.digital
Library: https://github.com/soniqo/speech-swift
The most counterintuitive: INT8 runs 3.3x faster than INT4 on the Neural Engine. A 332 MB CoreML model allocates 1,677 MB at runtime. And the right architecture uses both MLX and CoreML simultaneously — not one or the other.
MLX talks to the GPU — programmable, fast for large transformer inference. CoreML talks to the Neural Engine — fixed-function silicon, 135x real-time for small feedforward models like VAD, near-zero power draw.
All benchmarks are from speech-swift, our open-source Swift library for on-device speech AI: ASR, TTS, VAD, diarization, speech-to-speech — everything running locally on Apple Silicon with no API, no cloud, no data leaving the device.
Models on HF: aufklarer/Qwen3-ASR-0.6B-MLX-4bit · aufklarer/parakeet-tdt-0.6b-coreml-int8 · aufklarer/PersonaPlex-7B-MLX-4bit
Full article: https://blog.ivan.digital
Library: https://github.com/soniqo/speech-swift