French AI developer Mistral AI has launched two new transcription models designed to run directly on user devices, prioritizing privacy and speed. The models, Voxtral Mini Transcribe 2 and Voxtral Realtime, aim to keep sensitive conversations off the internet. They enable quick, accurate transcription without relying on cloud servers.
Mistral AI announced its latest transcription models on Wednesday, focusing on on-device processing to enhance user privacy. These tools are particularly suited for sensitive scenarios, such as discussions with doctors, lawyers, or journalistic interviews, where data security is paramount.
Voxtral Mini Transcribe 2 is described as "super, super small" by Pierre Stock, Mistral's vice president of science operations. This compactness allows it to operate on phones, laptops, or even wearables like smartwatches, eliminating the need to send audio to remote data centers. The second model, Voxtral Realtime, supports live transcription akin to closed captioning, with a latency of less than 200 milliseconds—fast enough to match reading speed and avoid delays of two or three seconds.
Stock emphasized the benefits of edge computing: "What you want is the transcription to happen super, super close to you. And the closest we can find to you is any edge device, so a laptop, a phone, a wearable like a smartwatch, for instance." By processing locally, the models reduce latency and protect privacy, as conversations never leave the device.
Both models support 13 languages and are available via Mistral's API, Hugging Face, or the company's AI Studio. In testing, Voxtral Realtime transcribed English with some Spanish accurately and quickly, though it occasionally mishandled proper names, such as rendering "Mistral AI" as "Mr. Lay Eye" and "Voxtral" as "VoxTroll." Stock noted that users can customize the models for better handling of specific jargon or names.
Mistral highlighted benchmark performance showing lower error rates than competitors. As Stock explained, "It's not enough to say, OK, I'll make a small model. What you need is a small model that has the same quality as larger models, right?" This balance of size, speed, and accuracy positions the models as a step forward in accessible AI transcription.