音声認識メモ(Kaldi)その18(Toolkitスクリプト(2))

自前で用意した音声データを認識させる手順はKaldi for Dummies tutorialに説明されている。

"for Dummies"("サルでも分かる")という位だから、「yes/no」サンプル（前回の記事）の次に試すのはこれがいいのだろう。

流れを大まかに書き出してみると以下のとおり。

Download Kaldi (GitHub から clone)
Data preparation ( 音声データと言語データの準備 )
Project finalization (Scoring scriptをコピー / SRILM インストール / Configファイル作成)
Running scripts creation (cmd.sh / path.sh / run.sh 作成)
Getting results (run.sh 実行)

言語モデルについては、Juliusの場合、連続単語なら「N-gram」か「DFA」(前回の記事)、孤立単語なら"-w"オプション（前回の記事）が用意されていたが、Kaldiの場合、「N-gram」択一の模様。

N-gramを作るための、言語モデルToolkitはいくつかある。
自然言語処理ツール
 言語モデル構築Toolメモ - Negative/Positive Thinking

チュートリアルどおりにSRILMを使うとする。
最新バージョンは「1.7.2」（更新日は「9 November 2016」）

今回は、「Running scripts creation」項の「run.sh」の流れを追ってみる。

スクリプト内の流れを大まかに書き出してみると以下のとおり。

音声データ準備（発話と話者の紐付け（話者が一人だと警告が出る）、特徴量抽出）
言語モデル準備（WFST化、GrammarとLexicon)
モノフォンモデルの作成と学習
モノフォンモデルを使ったデコード
モノフォンモデルを使ったアライメント（トライフォンモデル作成のインプットになる）
トライフォンモデルの作成と学習
トライフォンモデルを使ったデコード

「run.sh」を実行するにあたっては、以下のファイルが用意されていればOK。

% tree --charset C    
.
|-- cmd.sh
|-- conf
|   |-- decode.config
|   `-- mfcc.conf
|-- data
|   |-- local
|   |   |-- corpus.txt
|   |   `-- dict
|   |       |-- lexicon.txt
|   |       |-- nonsilence_phones.txt
|   |       |-- optional_silence.txt
|   |       `-- silence_phones.txt
|   |-- test
|   |   |-- text
|   |   |-- utt2spk
|   |   `-- wav.scp
|   `-- train
|       |-- text
|       |-- utt2spk
|       `-- wav.scp
|-- local
|   `-- score.sh
|-- path.sh
|-- run.sh
|-- steps -> ${KALDI_ROOT}/egs/wsj/s5/steps
`-- utils -> ${$KALDI_ROOT}/egs/wsj/s5/utils

「data/train」と「data/test」が音声データ。
「data/train」が学習用、「data/test」が検証用で、今回は"もしもし"という発話を３回分、３ファイル用意した。
（学習用に２ファイル、テスト用に１ファイル）

「data/local」が言語データ。
"もしもし"に加え、同じ音素で表現できる単語2つ（"もも"（桃）、"いも"(芋)）を加えた３つ

「run.sh」内部の処理を順に見てみる。

1. 音声データ準備

# ===== PREPARING ACOUSTIC DATA =====

# Making spk2utt files
utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
utils/utt2spk_to_spk2utt.pl data/test/utt2spk > data/test/spk2utt

# ===== FEATURES EXTRACTION =====

# Making feats.scp files
steps/make_mfcc.sh data/train exp/make_mfcc/train $mfccdir
steps/make_mfcc.sh data/test exp/make_mfcc/test $mfccdir

# Making cmvn.scp files
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir
steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test $mfccdir

2. 言語モデル準備

# ===== PREPARING LANGUAGE DATA =====
utils/prepare_lang.sh \
    data/local/dict \
    "<UNK>" \
    data/local/lang \
    data/lang

# ===== MAKING lm.arpa =====
lm_order=1 # language model order (n-gram quantity)
ngram-count \
    -order $lm_order \
    -write-vocab \
    data/local/tmp/vocab-full.txt \
    -wbdiscount \
    -text data/local/corpus.txt \
    -lm data/local/tmp/lm.arpa

# ===== MAKING G.fst =====

arpa2fst \
    --disambig-symbol=#0 \
    --read-symbol-table=data/lang/words.txt \
    data/local/tmp/lm.arpa \
    data/lang/G.fst

Lexicon(L.fst)

f:id:ichou1:20180701110517j:plain

3. モノフォンモデルの作成と学習

steps/train_mono.sh \
    data/train \           # <data-dir>
    data/lang \            # <lang-dir>
    exp/mono               # <exp-dir>

スクリプト内部でコールしているKaldiコマンドは以下のとおり。
"stage"という変数を持っており、途中から再開できるようにしている。

stage: -3

# Initialize monophone GMM
gmmbin/gmm-init-mono

stage: -2

# Creates training graphs(without transition-probabilities, by default)
bin/compile-train-graphs

stage: -1

均等アライメントをもとに統計量を作成

# Write an equally spaced alignment(for getting training started)
bin/align-equal-compiled

# Accumulate stats for GMM training
gmmbin/gmm-acc-stats-ali

stage: 0

# Do Maximum Likelihood re-estimation of GMM-based acoustic model
gmmbin/gmm-est

iteration (トレーニングの回数はデフォルトで40回)

# Modify GMM-based model to boost
gmmbin/gmm-boost-silence

# Align features given [GMM-based] models
gmmbin/gmm-align-compiled

# Accumulate stats for GMM training
gmmbin/gmm-acc-stats-ali

# (Above-mentioned ( stage 0 ))
gmmbin/gmm-est

"exp/mono/final.mdl"がアウトプットとなる。

4. モノフォンモデルを使ったデコード

言語データ（”data/lang/L.fst"、"data/lang/G.fst"、他）をもとに、HMM stateがinputとなる単語グラフ"HCLG.fst"を生成する。

f:id:ichou1:20180702081326p:plain

utils/mkgraph.sh \
    --mono \
    data/lang \      # <lang-dir>
    exp/mono \       # <model-dir>
    exp/mono/graph   # <graphdir>

"--mono"オプションは廃止の模様。

Note: the --mono, --left-biphone and --quinphone options are now deprecated and will be ignored.

"gmmbin/gmm-latgen-faster"コマンドを使ってdecodeを実行。
音声データはテスト用のもの（学習時のものとは異なる）

steps/decode.sh \
    --config conf/decode.config \
    exp/mono/graph \             # <graph-dir>
    data/test \                  # <data-dir>
    exp/mono/decode              # <decode-dir>

5. モノフォンモデルを使ったアライメント

steps/align_si.sh \
    data/train \          # <data-dir>
    data/lang \           # <lang-dir>
    exp/mono \            # <src-dir>
    exp/mono_ali          # <align-dir>

アライメント結果（exp/mono_ali/ali.1.gz）

utterance_id_001 2 1 1 1 1 1 8 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 18 17 17 206 208 207 210 242 244 246 245 245 245 245 266 265 265 265 268 267 267 267 267 267 267 267 267 270 269 269 269 269 269 194 193 193 193 196 195 195 195 195 198 197 197 197 218 217 217 217 217 220 219 219 219 219 222 221 221 242 244 246 245 245 245 245 245 245 245 245 245 245 245 245 245 245 245 266 268 270 269 269 269 269 269 269 269 269 269 269 269 188 190 189 189 189 189 189 189 192 191 191 191 191 191 3 1 1 1 1 1 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 7 5 5 14 15 15 15 15 15 15 15 15 15 15 12 10 10 10 10 10 10 10 10 10 10 10 10 10 18 
utterance_id_002 2 8 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 18 17 17 17 17 17 17 17 17 206 208 207 207 210 209 209 209 209 209 209 209 209 242 241 241 241 241 241 244 243 243 243 243 243 246 245 245 245 245 245 266 265 265 265 265 268 267 267 267 267 267 267 270 269 269 194 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 196 195 198 218 220 222 242 241 241 241 241 241 241 244 246 266 268 270 188 190 192 3 9 10 10 10 10 10 10 6 5 5 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 6 5 5 5 9 10 10 10 10 10 10 10 10 10 6 5 5 9 10 10 10 10 10 6 5 12 10 10 10 10 18

6. トライフォンモデルの作成と学習

steps/train_deltas.sh \
    2000 \             # <num-leaves>
    11000 \            # <tot-gauss>
    data/train \       # <data-dir>
    data/lang \        # <lang-dir>
    exp/mono_ali \     # <alignment-dir>
    exp/tri1           # <exp-dir>

スクリプト内部でコールしているKaldiコマンドは以下のとおり。

stage: -3

# Accumulate statistics for phonetic-context tree building.
bin/acc-tree-stats

# Sum statistics for phonetic-context tree building.
bin/sum-tree-stats

stage: -2

# Cluster phones (or sets of phones) into sets for various purposes
bin/cluster-phones

# Compile questions
bin/compile-questions

# Train decision tree
bin/build-tree

# Initialize GMM from decision tree and tree stats
gmm-init-model

# Does GMM mixing up (and Gaussian merging)
gmmbin/gmm-mixup

stage: -1

# Convert alignments from one decision-tree/model to another
bin/convert-ali

stage: 0

# Creates training graphs (without transition-probabilities, by default)
bin/compile-train-graphs

iteration (トレーニングの回数はデフォルトで35回)

# Align features given [GMM-based] models.
gmmbin/gmm-align-compiled

# Accumulate stats for GMM training.
gmmbin/gmm-acc-stats-ali

# Do Maximum Likelihood re-estimation of GMM-based acoustic model
gmmbin/gmm-est

"exp/tri1/final.mdl"がアウトプットとなる。

7. トライフォンモデルを使ったデコード

モノフォンと同様

utils/mkgraph.sh \
    data/lang \      # <lang-dir>
    exp/tri1 \       # <model-dir>
    exp/tri1/graph   # <graphdir> 

steps/decode.sh \
    --config conf/decode.config \
    exp/tri1/graph \             # <graph-dir>
    data/test \                  # <data-dir>
    exp/tri1/decode              # <decode-dir>