音声認識メモ（DeepSpeech）その1

Mozillaが公開する「DeepSpeech」を試してみる。
github.com

環境準備（インストール）

% pip install deepspeech

% pip show deepspeech
Name: deepspeech
Version: 0.7.0
Summary: A library for running inference on a DeepSpeech model
Home-page: https://github.com/mozilla/DeepSpeech
Author: Mozilla
Author-email: None
License: MPL-2.0
Location: /home/ichou1/.pyenv/versions/3.6.8/lib/python3.6/site-packages
Requires: numpy
Required-by:

今回はバージョン「0.7.0」を使用する。
（リリース日：2020年4月24日）
Release DeepSpeech 0.7.0 · mozilla/DeepSpeech · GitHub

旧バージョンを使っていて、バージョンを上げる場合は下記コマンドを実行する。

% pip install --upgrade deepspeech

環境準備（音響モデル）

使い方を見ると、音響モデル（"model"オプション）と音声ファイル（"audio"オプション）の指定が必須。

% deepspeech -h  
usage: deepspeech [-h] --model MODEL [--scorer SCORER] --audio AUDIO
                  [--beam_width BEAM_WIDTH] [--lm_alpha LM_ALPHA]
                  [--lm_beta LM_BETA] [--version] [--extended] [--json]

Running DeepSpeech inference.

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         Path to the model (protocol buffer binary file)
  --scorer SCORER       Path to the external scorer file
  --audio AUDIO         Path to the audio file to run (WAV format)
  --beam_width BEAM_WIDTH
                        Beam width for the CTC decoder
  --lm_alpha LM_ALPHA   Language model weight (lm_alpha). If not specified,
                        use default from the scorer package.
  --lm_beta LM_BETA     Word insertion bonus (lm_beta). If not specified, use
                        default from the scorer package.
  --version             Print version and exits
  --extended            Output string from extended metadata
  --json                Output json from metadata with timestamp of each word

学習済みの音響モデルが公開されているので、これを使う。
https://github.com/mozilla/DeepSpeech/releases/download/v0.7.0/deepspeech-0.7.0-models.pbmm

The acoustic models were trained on American English and the pbmm model achieves an 5.97% word error rate on the LibriSpeech clean test corpus.

LibriSpeechコーパスでの単語誤り率（WER）は6%を切っている。

認識させてみる

"yes"という音声ファイルを認識させてみる。
（音声ファイルについては下記をご参照）
音声ファイル前処理（データロード） - ichou1のブログ

% deepspeech \
--model 'deepspeech-0.7.0-models/deepspeech-0.7.0-models.pbmm' \
--audio 'speech_dataset/yes/3102f006_nohash_0.wav'

実行結果

Loading model from file deepspeech-0.7.0-models/deepspeech-0.7.0-models.pbmm
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.0-0-g3fbbca2
Loaded model in 0.0273s.
Running inference.
yes
Inference took 0.787s for 1.000s audio file.

下から2行目に注目。
"yes"と認識できている。

（参考）旧バージョンとの比較

this version is not backwards compatible with version 0.6.1 or earlier versions.
So when updating one will have to update code and models.

バージョン「0.6.1」以前との下位互換性はないとのこと。

下記はバージョン「0.6.1」で実行する場合のオプション指定。
トライフォン（"trie"オプション）と言語モデル（"lm"オプション）を明示的に指定している。

deepspeech \
--model deepspeech-0.6.1-models/output_graph.pbmm \
--trie deepspeech-0.6.1-models/trie \
--lm deepspeech-0.6.1-models/lm.binary \
--audio 'speech_dataset/yes/3102f006_nohash_0.wav'