Livebook.dev The Livebook Blog logo

The Livebook Blog

Back to Homepage Subscribe to Updates

Labels

  • All Posts
  • releases
  • tutorials
  • announcements
  • launch week

Jump to Month

  • September 2024
  • July 2024
  • March 2024
  • October 2023
  • August 2023
  • July 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • May 2022
  • January 2022
  • April 2021
Powered️ byAnnounceKit

Create yours, for free!

releaseslaunch week
a year ago

Speech-to-text with Whisper: timestamping, streaming, and parallelism, oh-my! - Launch Week 2 - Day 2

When we announced Bumblebee, a collection of pre-trained models inspired by Hugging Face Transformers, the Whisper speech-to-text model quickly became one of the favorite and most used models within the Elixir community.

Thanks to advancements in the overall Numerical Elixir ecosystem, Livebook v0.11 includes a highly improved integration with Whisper, which we will detail in this article.

If you want to skip ahead and give it a try, install Livebook and start a new notebook. Then click “+ Smart cell” and choose “Neural Network task.” You will find Whisper as Speech-to-text under Audio.


New features

There are three new features in our Whisper integration:

  1. Timestamping: we now include timestamps on audio segments.
  2. Streaming: our previous version of Whisper was limited to 30 seconds of audio, leaving it up to users to break their audio apart. This new version is capable of streaming both inputs and outputs. You can give arbitrarily long files to the model, which will be streamed as input, and the model will proceed to merge and stream transcriptions as they arrive.
  3. Parallelism: in addition to streaming, files with more than 30 seconds will be split and batched according to the Neural Network batch size. For example, with a batch size of 10, up to 5 minutes of audio can be processed in parallel. Thanks to this, we expect our models to perform inference an order of magnitude faster compared to Open AI’s implementation when transcribing larger files on the GPU.

Of course, all of those features work together, providing a delightful experience as you can see below, where we transcribe on the fly one of Thinking Elixir episodes:

When you combine the features above with Nx’s ability to run neural networks distributed across multiple machines and GPUs, Elixir developers now have a first-class, state-of-the-art, speech-to-text model ready to run, enjoy, and scale.

What now?

Try for yourself!

Transcribe an audio file using our built-in Neural Network Task Smart cell. Maybe start with a small file to quickly see the result. Then you can try a bigger one.

Download the latest Livebook version and have fun!

More of Launch Week 2

  • Day 1: Remote execution Smart cell
  • Day 3: Introducing File Integration
  • Day 4: Integration with SnowFlake and Microsoft SQL Server
  • Day 5: Vim and Emacs key bindings