Source URL: https://simonwillison.net/2024/Oct/1/whisper-large-v3-turbo-model/#atom-everything
Source: Simon Willison’s Weblog
Title: Whisper large-v3-turbo model
Feedly Summary: Whisper large-v3-turbo model
It’s OpenAI DevDay today. Last year they released a whole stack of new features, including GPT-4 vision and GPTs and their text-to-speech API, so I’m intrigued to see what they release today (I’ll be at the San Francisco event).
Looks like they got an early start on the releases, with the first new Whisper model since November 2023.
Whisper Turbo is a new speech-to-text model that fits the continued trend of distilled models getting smaller and faster while maintaining the same quality as larger models.
large-v3-turbo is 809M parameters – slightly larger than the 769M medium but significantly smaller than the 1550M large. OpenAI claim its 8x faster than large and requires 6GB of VRAM compared to 10GB for the larger model.
The model file is a 1.6GB download.
It’s already supported in both Hugging Face transformers – live demo here – and in mlx-whisper on Apple Silicon, via Awni Hannun:
import mlx_whisper
print(mlx_whisper.transcribe(
“path/to/audio",
path_or_hf_repo="mlx-community/whisper-turbo"
)["text"])
Awni reports:
Transcribes 12 minutes in 14 seconds on an M2 Ultra (~50X faster than real time).
Tags: openai, whisper, ai
AI Summary and Description: Yes
Summary: The text discusses the release of OpenAI’s Whisper Turbo speech-to-text model at a recent DevDay. This new model, designed to be smaller and faster while maintaining quality, highlights trends in AI model development and has significant implications for AI security and efficiency in voice data processing.
Detailed Description: The excerpt provides insight into OpenAI’s latest advancements in speech recognition technology with the Whisper Turbo model. This development is particularly relevant for professionals engaged in AI and AI security due to its implications for processing and handling voice data securely and efficiently.
– **Model Details:**
– Whisper Turbo is a speech-to-text model that represents a significant evolution in OpenAI’s technology.
– It features 809 million parameters, indicating a balanced design that offers improved processing speed without exceedingly complicated architecture.
– The model is claimed to be 8 times faster than its predecessor (the larger Whisper model) while requiring less graphics memory (VRAM), enhancing its utility and accessibility, particularly in resource-limited environments.
– **Performance Metrics:**
– The training and operational efficiency of the Whisper Turbo are emphasized, with a transcription capability of 12 minutes of audio completed in just 14 seconds on high-performance hardware (M2 Ultra).
– This showcases not only the model’s speed but also indicates potential applications in environments needing quick turnaround times for transcribing spoken language.
– **Deployment and Integration:**
– The model is compatible with Hugging Face transformers, making it readily accessible for developers and organizations looking to integrate advanced speech recognition into their applications.
– The simplicity of usage via an example code snippet highlights the model’s user-friendly interface for transcription tasks, enhancing its adoption in various sectors.
Overall, the release of the Whisper Turbo model reflects ongoing trends in the AI field towards developing more efficient, faster models that retain high quality. The enhancements present several security and compliance opportunities, particularly related to AI data handling, which professionals must consider when deploying these technologies.