{"id":5423,"date":"2025-06-05T19:04:25","date_gmt":"2025-06-05T17:04:25","guid":{"rendered":"http:\/\/www.blue-bears.com\/blog\/?p=5423"},"modified":"2025-10-16T19:20:17","modified_gmt":"2025-10-16T17:20:17","slug":"speech-to-text-transcription-audio","status":"publish","type":"post","link":"http:\/\/www.blue-bears.com\/blog\/?p=5423","title":{"rendered":"Speech to Text \/ Transcription Audio \/ Whisper"},"content":{"rendered":"<p>Le besoin =&gt; enregistrement de r\u00e9union =&gt; R\u00e9cup\u00e9rer un transcript =&gt; faire une synth\u00e8se (MAIA \/ BPCE ou\u00a0 Chat GPT ou Gemini ou &#8230;)<\/p>\n<ul>\n<li>Diff\u00e9rentes solutions :\n<ul>\n<li>Passer par de la reconnaissance vocale en ligne (Ps confidentialit\u00e9) =&gt; Open AI et Google ont des API pour cela (Token requis?)\n<ul>\n<li><a href=\"https:\/\/jeanviet.fr\/whisper\/\">https:\/\/jeanviet.fr\/whisper\/<\/a>\n<ul>\n<li><a href=\"https:\/\/colab.research.google.com\/drive\/1VE5UEn_dyH_e89Epxoph4kZeHrNvRkK5#scrollTo=MMdH4A5CQWtf\">https:\/\/colab.research.google.com\/drive\/1VE5UEn_dyH_e89Epxoph4kZeHrNvRkK5#scrollTo=MMdH4A5CQWtf<\/a><\/li>\n<\/ul>\n<\/li>\n<li><\/li>\n<\/ul>\n<\/li>\n<li>Passer par la reconnaissance en local =&gt; Installer OpenAI Whisper\n<ul>\n<li><a href=\"https:\/\/www.jeremymorgan.com\/tutorials\/generative-ai\/how-to-transcribe-audio\/\">https:\/\/www.jeremymorgan.com\/tutorials\/generative-ai\/how-to-transcribe-audio\/<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Install OPEN AI WISPER sous Linux (ce serveur)<\/li>\n<\/ul>\n<p><a href=\"https:\/\/openai.com\/index\/whisper\/\">https:\/\/openai.com\/index\/whisper\/<\/a><\/p>\n<h2 id=\"set-up-your-environment\">Set up Your Environment<\/h2>\n<p>For this demonstration, I\u2019m running Ubuntu under WSL in Windows. The instructions for setting it up in Ubuntu proper are the same. I have yet to try this on a Mac, but I will.<\/p>\n<p>The first thing you do, of course, is update the system.<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-fallback\" data-lang=\"fallback\"><span class=\"line\"><span class=\"cl\">sudo apt update\r\n<\/span><\/span><span class=\"line\"><span class=\"cl\">sudo apt upgrade\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p>Now, you will need some base packages installed on the system for this to work. Mainly FFmpeg, which can be installed with this:<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-fallback\" data-lang=\"fallback\"><span class=\"line\"><span class=\"cl\">sudo apt install ffmpeg\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p>You should be good to go. Let\u2019s create a Python environment:<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-fallback\" data-lang=\"fallback\"><span class=\"line\"><span class=\"cl\">mkdir whispertest &amp;&amp; cd whispertest\r\n<\/span><\/span><span class=\"line\"><span class=\"cl\">python3 -m venv whispertest\r\n<\/span><\/span><span class=\"line\"><span class=\"cl\">source whispertest\/bin\/activate\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p>Remember, you should see the environment name to the left of your prompt:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/www.jeremymorgan.com\/images\/tutorials\/generative-ai\/how-to-transcribe-audio\/how-to-transcribe-audio-00.webp\" alt=\"\u201cHow to Transcribe Audio to Text Python\u201d\" width=\"545\" height=\"65\" \/><\/p>\n<p>Then, we\u2019ll need to install the Rust setup tools:<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-gdscript3\" data-lang=\"gdscript3\"><span class=\"line\"><span class=\"cl\"><span class=\"n\">pip<\/span> <span class=\"n\">install<\/span> <span class=\"n\">setuptools<\/span><span class=\"o\">-<\/span><span class=\"n\">rust<\/span>\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p><strong>Note: If you have an NVidia GPU<\/strong><\/p>\n<p>If you have an NVIDIA GPU, you must\u00a0<a href=\"https:\/\/ubuntu.com\/server\/docs\/nvidia-drivers-installation\">install the NVIDIA drivers<\/a>\u00a0for this to work properly.<\/p>\n<p>You can verify they\u2019re installed correctly by typing:<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-fallback\" data-lang=\"fallback\"><span class=\"line\"><span class=\"cl\">nvidia-smi\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p>And you should see something like this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/www.jeremymorgan.com\/images\/tutorials\/generative-ai\/how-to-transcribe-audio\/how-to-transcribe-audio-01.webp\" alt=\"\u201cHow to Transcribe Audio to Text Python\u201d\" width=\"474\" height=\"237\" \/><\/p>\n<h2 id=\"install-whisper\">Install Whisper<\/h2>\n<p>Whisper runs as an executable within your Python environment. It\u2019s pretty cool.<\/p>\n<p>The best way to install it is:<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-fallback\" data-lang=\"fallback\"><span class=\"line\"><span class=\"cl\">pip install -U openai-whisper\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p>But you can also pull the latest version straight from the repository if you like:<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-fallback\" data-lang=\"fallback\"><span class=\"line\"><span class=\"cl\">pip install git+https:\/\/github.com\/openai\/whisper.git\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p>Either way, it will install a bunch of packages, so go get some ice water. When it\u2019s done, the whisper executable will be installed.<\/p>\n<p>I recorded a sample file, and here\u2019s how we can run it.<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-bash\" data-lang=\"bash\"><span class=\"line\"><span class=\"cl\">whisper <span class=\"o\">[<\/span>audio.flac audio.mp3 audio.wav<span class=\"o\">]<\/span> --model <span class=\"o\">[<\/span>model size<span class=\"o\">]<\/span>\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p>I will start with the tiny model just to see how it performs. Here\u2019s a list of available models<\/p>\n<table>\n<thead>\n<tr>\n<th>\u00a0Size<\/th>\n<th>Parameters<\/th>\n<th>English-only model<\/th>\n<th>Multilingual model<\/th>\n<th>Required VRAM<\/th>\n<th>Relative speed<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\u00a0tiny<\/td>\n<td>\u00a0 \u00a039 M<\/td>\n<td>\u00a0 \u00a0\u00a0<code>tiny.en<\/code><\/td>\n<td>\u00a0 \u00a0 \u00a0\u00a0<code>tiny<\/code><\/td>\n<td>\u00a0 \u00a0 ~1 GB<\/td>\n<td>\u00a0 \u00a0 \u00a0~32x<\/td>\n<\/tr>\n<tr>\n<td>\u00a0base<\/td>\n<td>\u00a0 \u00a074 M<\/td>\n<td>\u00a0 \u00a0\u00a0<code>base.en<\/code><\/td>\n<td>\u00a0 \u00a0 \u00a0\u00a0<code>base<\/code><\/td>\n<td>\u00a0 \u00a0 ~1 GB<\/td>\n<td>\u00a0 \u00a0 \u00a0~16x<\/td>\n<\/tr>\n<tr>\n<td>small<\/td>\n<td>\u00a0 244 M<\/td>\n<td>\u00a0 \u00a0\u00a0<code>small.en<\/code><\/td>\n<td>\u00a0 \u00a0 \u00a0<code>small<\/code><\/td>\n<td>\u00a0 \u00a0 ~2 GB<\/td>\n<td>\u00a0 \u00a0 \u00a0~6x<\/td>\n<\/tr>\n<tr>\n<td>medium<\/td>\n<td>\u00a0 769 M<\/td>\n<td>\u00a0 \u00a0<code>medium.en<\/code><\/td>\n<td>\u00a0 \u00a0 \u00a0<code>medium<\/code><\/td>\n<td>\u00a0 \u00a0 ~5 GB<\/td>\n<td>\u00a0 \u00a0 \u00a0~2x<\/td>\n<\/tr>\n<tr>\n<td>large<\/td>\n<td>\u00a0 1550 M<\/td>\n<td>\u00a0 \u00a0 \u00a0 \u00a0N\/A<\/td>\n<td>\u00a0 \u00a0 \u00a0<code>large<\/code><\/td>\n<td>\u00a0 \u00a0~10 GB<\/td>\n<td>\u00a0 \u00a0 \u00a0 1x<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>I\u2019ll start with the smallest model and see its accuracy, then work my way up if needed.<\/p>\n<p>Here\u2019s the command I ran to parse and extract from my sample file:<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-fallback\" data-lang=\"fallback\"><span class=\"line\"><span class=\"cl\">whisper sample-audio.wav --model tiny\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p>And lucky for me, it was transcribed perfectly:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.jeremymorgan.com\/images\/tutorials\/generative-ai\/how-to-transcribe-audio\/how-to-transcribe-audio-02.webp\" alt=\"\u201cHow to Transcribe Audio to Text Python\u201d\" \/><\/p>\n<p>Your results will vary. If you don\u2019t like the output you can always step it up to a larger model, which will take more memory and a longer amount of time.<\/p>\n<p>So, what else can you do with this tool?<\/p>\n<h2 id=\"building-a-cool-python-script\">Building a Cool Python Script<\/h2>\n<p>The Whisper service has a bunch of cool features that I don\u2019t use, like translation! But what if we want to script this stuff, like processing 100 audio files or something? Building a Python script to run it is easy.<\/p>\n<p>Here\u2019s a script straight from the GitHub page:<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-gdscript3\" data-lang=\"gdscript3\"><span class=\"line\"><span class=\"cl\"><span class=\"n\">import<\/span> <span class=\"n\">whisper<\/span>\r\n<\/span><\/span>\r\n<span class=\"line\"><span class=\"cl\"><span class=\"n\">model<\/span> <span class=\"o\">=<\/span> <span class=\"n\">whisper<\/span><span class=\"o\">.<\/span><span class=\"n\">load_model<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"base\"<\/span><span class=\"p\">)<\/span>\r\n<\/span><\/span><span class=\"line\"><span class=\"cl\"><span class=\"n\">result<\/span> <span class=\"o\">=<\/span> <span class=\"n\">model<\/span><span class=\"o\">.<\/span><span class=\"n\">transcribe<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"audio.mp3\"<\/span><span class=\"p\">)<\/span>\r\n<\/span><\/span><span class=\"line\"><span class=\"cl\"><span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"n\">result<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"text\"<\/span><span class=\"p\">])<\/span>\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p>And when I run it, it shows clean text output.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.jeremymorgan.com\/images\/tutorials\/generative-ai\/how-to-transcribe-audio\/how-to-transcribe-audio-03.webp\" alt=\"\u201cHow to Transcribe Audio to Text Python\u201d\" \/><\/p>\n<p>You can of course, write this to its own text file:<\/p>\n<div class=\"highlight\">\n<pre class=\"chroma\" tabindex=\"0\"><code class=\"language-python\" data-lang=\"python\"><span class=\"line\"><span class=\"cl\"><span class=\"k\">with<\/span> <span class=\"nb\">open<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"output.txt\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"w\"<\/span><span class=\"p\">)<\/span> <span class=\"k\">as<\/span> <span class=\"n\">file<\/span><span class=\"p\">:<\/span>\r\n<\/span><\/span><span class=\"line\"><span class=\"cl\"> <span class=\"n\">file<\/span><span class=\"o\">.<\/span><span class=\"n\">write<\/span><span class=\"p\">(<\/span><span class=\"n\">result<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"text\"<\/span><span class=\"p\">])<\/span>\r\n<\/span><\/span><\/code><\/pre>\n<\/div>\n<p>There are tons of options available. It also does transcriptions in other languages as well.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Le besoin =&gt; enregistrement de r\u00e9union =&gt; R\u00e9cup\u00e9rer un transcript =&gt; faire une synth\u00e8se (MAIA \/ BPCE ou\u00a0 Chat GPT ou Gemini [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-5423","post","type-post","status-publish","format-standard","hentry","category-non-classe"],"_links":{"self":[{"href":"http:\/\/www.blue-bears.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5423","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.blue-bears.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.blue-bears.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.blue-bears.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.blue-bears.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5423"}],"version-history":[{"count":4,"href":"http:\/\/www.blue-bears.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5423\/revisions"}],"predecessor-version":[{"id":5490,"href":"http:\/\/www.blue-bears.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5423\/revisions\/5490"}],"wp:attachment":[{"href":"http:\/\/www.blue-bears.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5423"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.blue-bears.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5423"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.blue-bears.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5423"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}