Installation Wisper OpenAI sous DEBIAN

Installation :

Mise à jour paquets :

sudo apt update
sudo apt upgrade -y
sudo apt install -y python3-pip python3-dev build-essential
sudo apt install -y python3-venv
~~python3 -m venv whisper-env~~

Installation Torch (no cache : sauf si mémoire > 4 GO)

pip install torch --no-cache-dir

Installation FFMPEG

sudo apt install -y ffmpeg

Créer l’environnement virtuel /// / Ne fonctionne pas avec moins de 4Go de memoire vive (3.8G à disposition !!!) pour l’installation de Torch (l’installation de Torch hors Env Python ne fonctionne pas)

~~source whisper-env/bin/activate~~

Check :

python3 --version
pip3 --version Installation Wisper :

pip install git+https://github.com/openai/whisper.git

Utilisation de Whisper => Taille fichier audio 25Mo MAX

Sinon … perte de connexion.

Heureusement on peut compresser (44Mo de MP3 deviennent 9Mo de Ogg ) … mais ogg entraine : connexion terminée 🙁

ffmpeg -i audio.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip audio.ogg

Les commandes :

cd /home/YSALMON/audio/

whisper --model small "audio.mp3" --language French --fp16 False --output_format srt  
whisper --model small "audio.mp3" --language French --fp16 False --output_format srt --task transcribe
whisper --model small "/home/YSALMON/audio/audio.mp3" --language French --fp16 False --output_format srt --task transcribe
whisper --model small "/home/YSALMON/audio/audio.ogg" --language French --fp16 False --output_format srt --task transcribe

# model = whisper.load_model("tiny.en") # model = whisper.load_model("base.en") # model = whisper.load_model("small") # load the small model => OK # model = whisper.load_model("medium.en") # model = whisper.load_model("large") # Plantage !!!!!!!!!!

# model = whisper.load_model("medium) # Plantage !!!!!!!!!

Fichiers de sortie dans le répertoire courant aux formats .json .tsv .txt .vtt .srt (sous-titres)

Utilisable sous VLC =>Charger le fichier audio

menu sous-titres => Ajouter fichier sous-titre (.srt .txt .vtt)
Audio => Visualisation => Spectromètre (sinon marche pas !)

Observations :

Plantage => Déconnexion => dépassement mémoire (modèles trop gros en mémoire : seul le small passe)
1 minute audio = 4 minutes de temps de transcription => sur :
- Linux 5.10.0-34-amd64 on x86_64
- Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 4 cores
- Mémoire 3.81GiB

Références :

Les commandes :

root@vxxxx:/home/YSALMON/audio# whisper –help
usage: whisper

[-h]
[–model MODEL]
[–model_dir MODEL_DIR]
[–device DEVICE]
[–output_dir OUTPUT_DIR]
[–output_format {txt,vtt,srt,tsv,json,all}]
[–verbose VERBOSE]
[–task {transcribe,translate}]
[–language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
[–temperature TEMPERATURE]
[–best_of BEST_OF]
[–beam_size BEAM_SIZE] 5 par defaut = nombre d’hypothéses faites => l’augmenter améliore la qualité et la mémoire utilsiée + temps de calcul.
[–patience PATIENCE] : 1.0 par ddefaut : temps accordé pour trouver une meilleure hypothése.
[–length_penalty LENGTH_PENALTY] : 1.0 par defaut => augmenter = allonger la taille de la sortie.
[–suppress_tokens SUPPRESS_TOKENS]
[–initial_prompt INITIAL_PROMPT] : le context : ex: « initial_prompt=Meeting transcript:
[–carry_initial_prompt CARRY_INITIAL_PROMPT]
[–condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT]
[–fp16 FP16]
[–temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
[–compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]
[–logprob_threshold LOGPROB_THRESHOLD]
[–no_speech_threshold NO_SPEECH_THRESHOLD]
[–word_timestamps WORD_TIMESTAMPS]
[–prepend_punctuations PREPEND_PUNCTUATIONS]
[–append_punctuations APPEND_PUNCTUATIONS]
[–highlight_words HIGHLIGHT_WORDS]
[–max_line_width MAX_LINE_WIDTH]
[–max_line_count MAX_LINE_COUNT]
[–max_words_per_line MAX_WORDS_PER_LINE]
[–threads THREADS]
[–clip_timestamps CLIP_TIMESTAMPS]
[–hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD]
audio

positional arguments:
audio audio file(s) to transcribe

optional arguments:
-h, –help show this help message and exit
–model MODEL name of the Whisper model to use (default: turbo)
–model_dir MODEL_DIR
the path to save model files; uses ~/.cache/whisper by default (default: None)
–device DEVICE device to use for PyTorch inference (default: cpu)
–output_dir OUTPUT_DIR, -o OUTPUT_DIR
directory to save the outputs (default: .)
–output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
format of the output file; if not specified, all available formats will be produced
(default: all)
–verbose VERBOSE whether to print out the progress and debug messages (default: True)
–task {transcribe,translate}
whether to perform X->X speech recognition (‘transcribe’) or X->English translation
(‘translate’) (default: transcribe)
–language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
language spoken in the audio, specify None to perform language detection (default: None)
–temperature TEMPERATURE
temperature to use for sampling (default: 0)
–best_of BEST_OF number of candidates when sampling with non-zero temperature (default: 5)
–beam_size BEAM_SIZE
number of beams in beam search, only applicable when temperature is zero (default: 5)
–patience PATIENCE optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424,
the default (1.0) is equivalent to conventional beam search (default: None)
–length_penalty LENGTH_PENALTY
optional token length penalty coefficient (alpha) as in
https://arxiv.org/abs/1609.08144, uses simple length normalization by default (default:
None)
–suppress_tokens SUPPRESS_TOKENS
comma-separated list of token ids to suppress during sampling; ‘-1’ will suppress most
special characters except common punctuations (default: -1)
–initial_prompt INITIAL_PROMPT
optional text to provide as a prompt for the first window. (default: None)
–carry_initial_prompt CARRY_INITIAL_PROMPT
if True, prepend initial_prompt to every internal decode() call. May reduce the
effectiveness of condition_on_previous_text (default: False)
–condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
if True, provide the previous output of the model as a prompt for the next window;
disabling may make the text inconsistent across windows, but the model becomes less
prone to getting stuck in a failure loop (default: True)
–fp16 FP16 whether to perform inference in fp16; True by default (default: True)
–temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
temperature to increase when falling back when the decoding fails to meet either of the
thresholds below (default: 0.2)
–compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
if the gzip compression ratio is higher than this value, treat the decoding as failed
(default: 2.4)
–logprob_threshold LOGPROB_THRESHOLD
if the average log probability is lower than this value, treat the decoding as failed
(default: -1.0)
–no_speech_threshold NO_SPEECH_THRESHOLD
if the probability of the <|nospeech|> token is higher than this value AND the decoding
has failed due to `logprob_threshold`, consider the segment as silence (default: 0.6)
–word_timestamps WORD_TIMESTAMPS
(experimental) extract word-level timestamps and refine the results based on them
(default: False)
–prepend_punctuations PREPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the next word (default:
« ‘“¿([{-)
–append_punctuations APPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the previous word
(default: « ‘.。,，!！?？:：”)]}、)
–highlight_words HIGHLIGHT_WORDS
(requires –word_timestamps True) underline each word as it is spoken in srt and vtt
(default: False)
–max_line_width MAX_LINE_WIDTH
(requires –word_timestamps True) the maximum number of characters in a line before
breaking the line (default: None)
–max_line_count MAX_LINE_COUNT
(requires –word_timestamps True) the maximum number of lines in a segment (default:
None)
–max_words_per_line MAX_WORDS_PER_LINE
(requires –word_timestamps True, no effect with –max_line_width) the maximum number of
words in a segment (default: None)
–threads THREADS number of threads used by torch for CPU inference; supercedes
MKL_NUM_THREADS/OMP_NUM_THREADS (default: 0)
–clip_timestamps CLIP_TIMESTAMPS
comma-separated list start,end,start,end,… timestamps (in seconds) of clips to
process, where the last end timestamp defaults to the end of the file (default: 0)
–hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD
(requires –word_timestamps True) skip silent periods longer than this threshold (in
seconds) when a possible hallucination is detected (default: None)

L	M	M	J	V	S	D
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Blue-Bears

Finance Blog

Installation Wisper OpenAI sous DEBIAN

Installation :

Utilisation de Whisper => Taille fichier audio 25Mo MAX

Observations :

Références :

Les commandes :