mistralai
/

Voxtral-Mini-4B-Realtime-2602

Automatic Speech Recognition

voxtral_realtime

Model card Files Files and versions

patrickvonplaten commited on about 18 hours ago

Commit

b3de5c8

·

verified ·

1 Parent(s): f74a921

Update README.md

Files changed (1) hide show

README.md +11 -6

README.md CHANGED Viewed

@@ -118,7 +118,7 @@ Voxtral Mini 4B Realtime is competitive to leading offline models and shows sign
 The model can also be deployed with the following libraries:
 - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
-- [`transformers (WIP)`](https://github.com/huggingface/transformers): See [here](#transformers)
 - *Community Contributions*: See [here](#community-contributions-untested)
 ### vLLM (recommended)
@@ -214,20 +214,25 @@ Make sure to have `mistral-common` installed with audio dependencies:
 pip install --upgrade "mistral-common[audio]"
 ```
 ```python
-import torch
 from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor
-from datasets import load_dataset
 repo_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
 processor = AutoProcessor.from_pretrained(repo_id)
 model = VoxtralRealtimeForConditionalGeneration.from_pretrained(repo_id, device_map="auto")
-ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
-audio = ds[0]["audio"]["array"]
-inputs = processor(audio, return_tensors="pt")
 inputs = inputs.to(model.device, dtype=model.dtype)
 outputs = model.generate(**inputs)

 The model can also be deployed with the following libraries:
 - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
+- [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)
 - *Community Contributions*: See [here](#community-contributions-untested)
 ### vLLM (recommended)
 pip install --upgrade "mistral-common[audio]"
 ```
+#### Usage
 ```python
 from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor
+from mistral_common.tokens.tokenizers.audio import Audio
+from huggingface_hub import hf_hub_download
 repo_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
 processor = AutoProcessor.from_pretrained(repo_id)
 model = VoxtralRealtimeForConditionalGeneration.from_pretrained(repo_id, device_map="auto")
+repo_id = "patrickvonplaten/audio_samples"
+audio_file = hf_hub_download(repo_id=repo_id, filename="bcn_weather.mp3", repo_type="dataset")
+audio = Audio.from_file(audio_file, strict=False)
+audio.resample(processor.feature_extractor.sampling_rate)
+inputs = processor(audio.audio_array, return_tensors="pt")
 inputs = inputs.to(model.device, dtype=model.dtype)
 outputs = model.generate(**inputs)