Speech-to-text with Whisper: timestamping, streaming, and parallelism, oh-my! - Launch Week 2 - Day 2 - Livebook.dev

When we announced Bumblebee, a collection of pre-trained models inspired by Hugging Face Transformers, the Whisper speech-to-text model quickly became one of the favorite and most used models within the Elixir community.

Thanks to advancements in the overall Numerical Elixir ecosystem, Livebook v0.11 includes a highly improved integration with Whisper, which we will detail in this article.

If you want to skip ahead and give it a try, install Livebook and start a new notebook. Then click “+ Smart cell” and choose “Neural Network task.” You will find Whisper as Speech-to-text under Audio.

New features

There are three new features in our Whisper integration:

Timestamping: we now include timestamps on audio segments.
Streaming: our previous version of Whisper was limited to 30 seconds of audio, leaving it up to users to break their audio apart. This new version is capable of streaming both inputs and outputs. You can give arbitrarily long files to the model, which will be streamed as input, and the model will proceed to merge and stream transcriptions as they arrive.
Parallelism: in addition to streaming, files with more than 30 seconds will be split and batched according to the Neural Network batch size. For example, with a batch size of 10, up to 5 minutes of audio can be processed in parallel. Thanks to this, we expect our models to perform inference an order of magnitude faster compared to Open AI’s implementation when transcribing larger files on the GPU.

Of course, all of those features work together, providing a delightful experience as you can see below, where we transcribe on the fly one of Thinking Elixir episodes:

When you combine the features above with Nx’s ability to run neural networks distributed across multiple machines and GPUs, Elixir developers now have a first-class, state-of-the-art, speech-to-text model ready to run, enjoy, and scale.

What now?

Try for yourself!

Transcribe an audio file using our built-in Neural Network Task Smart cell. Maybe start with a small file to quickly see the result. Then you can try a bigger one.

Download the latest Livebook version and have fun!

The Livebook Blog

Speech-to-text with Whisper: timestamping, streaming, and parallelism, oh-my! - Launch Week 2 - Day 2

New features

What now?

More of Launch Week 2