Speech to Text / Transcription Audio

Le besoin => enregistrement de réunion => Récupérer un transcript => faire une synthèse (MAIA / BPCE ou  Chat GPT ou Gemini ou …)

https://openai.com/index/whisper/

Set up Your Environment

For this demonstration, I’m running Ubuntu under WSL in Windows. The instructions for setting it up in Ubuntu proper are the same. I have yet to try this on a Mac, but I will.

The first thing you do, of course, is update the system.

sudo apt update
sudo apt upgrade

Now, you will need some base packages installed on the system for this to work. Mainly FFmpeg, which can be installed with this:

sudo apt install ffmpeg

You should be good to go. Let’s create a Python environment:

mkdir whispertest && cd whispertest
python3 -m venv whispertest
source whispertest/bin/activate

Remember, you should see the environment name to the left of your prompt:

“How to Transcribe Audio to Text Python”

Then, we’ll need to install the Rust setup tools:

pip install setuptools-rust

Note: If you have an NVidia GPU

If you have an NVIDIA GPU, you must install the NVIDIA drivers for this to work properly.

You can verify they’re installed correctly by typing:

nvidia-smi

And you should see something like this:

“How to Transcribe Audio to Text Python”

Install Whisper

Whisper runs as an executable within your Python environment. It’s pretty cool.

The best way to install it is:

pip install -U openai-whisper

But you can also pull the latest version straight from the repository if you like:

pip install git+https://github.com/openai/whisper.git

Either way, it will install a bunch of packages, so go get some ice water. When it’s done, the whisper executable will be installed.

I recorded a sample file, and here’s how we can run it.

whisper [audio.flac audio.mp3 audio.wav] --model [model size]

I will start with the tiny model just to see how it performs. Here’s a list of available models

 Size Parameters English-only model Multilingual model Required VRAM Relative speed
 tiny    39 M     tiny.en       tiny     ~1 GB      ~32x
 base    74 M     base.en       base     ~1 GB      ~16x
small   244 M     small.en      small     ~2 GB      ~6x
medium   769 M    medium.en      medium     ~5 GB      ~2x
large   1550 M        N/A      large    ~10 GB       1x

I’ll start with the smallest model and see its accuracy, then work my way up if needed.

Here’s the command I ran to parse and extract from my sample file:

whisper sample-audio.wav --model tiny

And lucky for me, it was transcribed perfectly:

“How to Transcribe Audio to Text Python”

Your results will vary. If you don’t like the output you can always step it up to a larger model, which will take more memory and a longer amount of time.

So, what else can you do with this tool?

Building a Cool Python Script

The Whisper service has a bunch of cool features that I don’t use, like translation! But what if we want to script this stuff, like processing 100 audio files or something? Building a Python script to run it is easy.

Here’s a script straight from the GitHub page:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

And when I run it, it shows clean text output.

“How to Transcribe Audio to Text Python”

You can of course, write this to its own text file:

with open("output.txt", "w") as file:
 file.write(result["text"])

There are tons of options available. It also does transcriptions in other languages as well.