音声認識メモ(Kaldi)その19(Toolkitスクリプト(3))

前回の「Kaldi for Dummies tutorial」では、トライフォンの初期学習までであった。

TRI1 - simple triphone training (first triphone pass).

この後の処理を確認してみる。

「egs/rm/s5/RESULTS」には各実装（experiments）でのWERが出力されており、いくつかを書き並べてみると以下のとおり。

mono

Monophone, MFCC + delta + accel

tri1

MFCC + delta + accel

tri2a

MFCC + delta + accel (on top of better alignments)

tri2b

LDA + MLLT

tri3b

LDA + MLLT + SAT

tri3c

raw-fMLLR ( fMLLR on the raw MFCCs )

sgmm2_4[a-c]

SGMM2 is a new version of the code that has tying of the substates a bit like "state-clustered tied mixture" systems; and which has speaker-dependent mixture weights.

nnet4[a-e]

Deep neural net -- various types of hybrid system.

dnn4b

MFCC, LDA, fMLLR feaures, (Karel - 30.7.2015)

cnn4c

FBANK + pitch features, (Karel - 30.7.2015)

この中で、「nnet4d」（nnet2のプライマリレシピ）をターゲットとして、triphone初期モデル(tri1)までの流れを逆にたどってみる。
（GPUを使わない環境で試しているので、GPUを使用しない条件下で確認）

公式サイトの説明より、「rm/s5/local/run_nnet2.sh」が起点となるスクリプトであることを確認。

The first place to look to get a top level overview of the neural net training is probably the scripts. In the standard example scripts in egs/rm/s5, egs/wsj/s5 and egs/swbd/s5b, the top-level script is run.sh. This script calls (sometimes commented out) a script called local/run_nnet2.sh. This is the top-level example script for Dan's setup.

rm/s5/local/run_nnet2.shより抜粋

# **THIS IS THE PRIMARY RECIPE (40-dim + fMLLR + p-norm neural net)**
local/nnet2/run_4d.sh --use-gpu false

egs/rm/s5/local/nnet2/run_4d.shより抜粋

steps/nnet2/train_pnorm_fast.sh
    data/train \
    data/lang \
    exp/tri3b_ali \
    exp/nnet4d

トレーニングのアウトプット「exp/nnet4d」を作成するには、インプットとしてアライメントデータ「exp/tri3b_ali」が必要。

egs/rm/s5/run.shより抜粋

# Align all data with LDA+MLLT+SAT system (tri3b)
steps/align_fmllr.sh \
    --use-graphs true \
    data/train \
    data/lang \
    exp/tri3b \
    exp/tri3b_ali

アライメントのアウトプット「exp/tri3b_ali」を作成するには、インプットとして「exp/tri3b」が必要。

egs/rm/s5/run.shより抜粋

## Do LDA+MLLT+SAT
steps/train_sat.sh \
	1800 \          # <#leaves>
	9000 \          # <#gauss>
	data/train \    # <data>
	data/lang \     # <lang>
	exp/tri2b_ali \ # <ali-dir>
	exp/tri3b       # <exp-dir>

トレーニングのアウトプット「exp/tri3b」を作成するには、インプットとしてアライメントデータ「exp/tri2b_ali」が必要。

egs/rm/s5/run.shより抜粋

# Align all data with LDA+MLLT system (tri2b)
steps/align_si.sh \
	--use-graphs true \
	data/train \
	data/lang \
	exp/tri2b \
	exp/tri2b_ali

アライメントデータ「exp/tri2b_ali」を作成するには、インプットとして「exp/tri2b」が必要。

egs/rm/s5/run.shより抜粋

# train and decode tri2b [LDA+MLLT]
steps/train_lda_mllt.sh \
	1800 \         # <#leaves>
	9000 \         # <#gauss>
	data/train \   # <data>
	data/lang \    # <lang>
	exp/tri1_ali \ # <ali-dir>
	exp/tri2b      # <exp-dir>

トレーニングデータ「exp/tri2b」を作成するには、インプットとしてアライメントデータ「exp/tri1_ali」が必要。

egs/rm/s5/run.shより抜粋

# align tri1
steps/align_si.sh \
	--use-graphs true \
	data/train \
	data/lang \
	exp/tri1 \
	exp/tri1_ali

アライメントデータ「exp/tri1_ali」を作成するには、インプットとしてトレーニングデータ「exp/tri1」が必要。

「exp/tri1」から[exp/nnet4d]までの流れを書き出してみると以下のとおり。

1. トライフォンモデル(MFCC + delta + accel)を使ったアライメント

アウトプットは「exp/tri1_ali」

2. トライフォンモデル(LDA + MLLT)の作成と学習

アウトプットは「exp/tri2b」

3. トライフォンモデル(LDA + MLLT)を使ったアライメント

アウトプットは「exp/tri2b_ali」

4. トライフォンモデル(LDA + MLLT + SAT)の作成と学習

アウトプットは「exp/tri3b」

5. トライフォンモデル(LDA + MLLT + SAT)を使ったアライメント

アウトプットは「exp/tri3b_ali」

6. Neural Networkモデルの作成と学習

アウトプットは「exp/tri4d」

ichou1のブログ

主に音声認識、時々、データ分析のことを書く

音声認識メモ(Kaldi)その19(Toolkitスクリプト(3))

mono

tri1

tri2a

tri2b

tri3b

tri3c

sgmm2_4[a-c]

nnet4[a-e]

dnn4b

cnn4c

rm/s5/local/run_nnet2.shより抜粋

egs/rm/s5/local/nnet2/run_4d.shより抜粋

egs/rm/s5/run.shより抜粋

egs/rm/s5/run.shより抜粋

egs/rm/s5/run.shより抜粋

egs/rm/s5/run.shより抜粋

egs/rm/s5/run.shより抜粋

1. トライフォンモデル(MFCC + delta + accel)を使ったアライメント

2. トライフォンモデル(LDA + MLLT)の作成と学習

3. トライフォンモデル(LDA + MLLT)を使ったアライメント

4. トライフォンモデル(LDA + MLLT + SAT)の作成と学習

5. トライフォンモデル(LDA + MLLT + SAT)を使ったアライメント

6. Neural Networkモデルの作成と学習