音声認識メモ(Kaldi)その17(Toolkitスクリプト)

Kaldiは、Bash スクリプトで実行するコマンドをコントロールしている。
今回はスクリプトについて確認してみる。

GitHubからダウンロードした一式のディレクトリ構成については以下のとおり。

egs (今回の確認対象)
src (ソースコード)
misc (論文など？未確認)
tools (外部ツール、OpenFST、ATLASなど)
windows (WindowsOS用)

詳細はkaldi公式サイトによる説明を参照（"Kaldi directories structure"の項）

GitHub(kaldi公式)

「egs」配下に、各コーパスに対応したサンプルスクリプトが格納されている。

egs – example scripts allowing you to quickly build ASR systems for over 30 popular speech corporas (documentation is attached for each project),

自前で音声データを用意する場合には、どうするか。
Kaldi公式のチュートリアルを読むと、「egs/wsj/s5」配下を流用すればいい旨の説明がある。

Project finalization -> Tools attachment の項より抜粋

From kaldi-trunk/egs/wsj/s5 copy two folders (with the whole content) - utils and steps - and put them in your kaldi-trunk/egs/digits directory.
You can also create links to these directories.

「wsj」はWall Street Journal news textのコーパスらしい。

egs/wsj/README.txt より抜粋

About the Wall Street Journal corpus:
This is a corpus of read sentences from the Wall Street Journal, recorded under clean conditions.
The vocabulary is quite large. About 80 hours of training data.
Available from the LDC as either: [ catalog numbers LDC93S6A (WSJ0) and LDC94S13A (WSJ1) ]
or: [ catalog numbers LDC93S6B (WSJ0) and LDC94S13B (WSJ1) ]
....

他のコーパスのディレクトリ（例えば「egs/rm/steps」を見ても、「egs/wsj/steps」へのシンボリックリンクになっている。

/opt/kaldi/egs/rm/s5% ls -l steps
lrwxrwxrwx 1 ichou1 ichou1 18  2月  5 19:46 steps -> ../../wsj/s5/steps
/opt/kaldi/egs/rm/s5% file steps 
steps: symbolic link to `../../wsj/s5/steps' 
/opt/kaldi/egs/rm/s5%

コーパスは無いが、kaldiを試してみたい場合用に「egs/yesno」が用意されている。
これは音声データ（.wav）も格納されているので、すぐに試せる。
（"YES"と"NO"のどちらかを8回、パターンを変えつつ発話。トレーニング用に31ファイル、検証用に29ファイル）

egs/yesno/README より抜粋

The "yesno" corpus is a very small dataset of recordings of one individual saying yes or no multiple times per recording, in Hebrew.

egs/yesno/s5/waves_yesno/README より抜粋

The archive "waves_yesno.tar.gz" contains 60 .wav files, sampled at 8 kHz.
All were recorded by the same male speaker, in English (although the individual is not a native speaker).
In each file, the individual says 8 words;
each word is either "yes" or "no", so each file is a random sequence of 8 yes-es or noes.
There is no separate transcription provided;
the sequence is encoded in the filename, with 1 for yes and 0 for no, for instance:

実行方法

cd egs/yesno/s5
./run.sh

内部でやっていること

Data preparation（データ準備）

--> 「local/prepare_dict.sh」、「local/prepare_dict.sh」、「utils/prepare_lang.sh」、「local/prepare_lm.sh」を実行

Feature extraction（特徴量抽出）

--> 「steps/make_mfcc.sh」、「steps/compute_cmvn_stats.sh」、「utils/fix_data_dir.sh」を実行
（「steps」、「utils」は、「egs/wsj/s5/steps」、「egs/wsj/s5/utils」へのリンク）

Mono training（モノフォン学習）

--> 「steps/train_mono.sh」を実行

Graph compilation（グラフ作成）

--> 「utils/mkgraph.sh」を実行

Decoding（認識）

--> 「steps/decode.sh」を実行

実行するとコンソール上には、WER(単語誤り率)が表示される。
decodeの結果は、ログ（egs/yesno/s5/exp/mono0a/decode_test_yesno/log/decode.1.log）で確認できる。

例）「egs/yesno/s5/waves_yesno/1_0_0_0_0_0_0_0.wav」の認識結果

1_0_0_0_0_0_0_0 YES NO NO NO NO NO NO NO

トップディレクトリが「/opt/kaldi」であるとして、docodeを直接実行する場合のコマンド（結果は標準出力にテキスト形式で出力）

decode(lattice無し)

/opt/kaldi/src/gmmbin/gmm-decode-faster \
--word-symbol-table=/opt/kaldi/egs/yesno/s5/exp/mono0a/graph_tgpr/words.txt \
/opt/kaldi/egs/yesno/s5/exp/mono0a/40.mdl \
/opt/kaldi/egs/yesno/s5/exp/mono0a/graph_tgpr/HCLG.fst \
"ark,s,cs:/opt/kaldi/src/featbin/apply-cmvn --utt2spk=ark:/opt/kaldi/egs/yesno/s5/data/test_yesno/split1/1/utt2spk scp:/opt/kaldi/egs/yesno/s5/data/test_yesno/split1/1/cmvn.scp scp:/opt/kaldi/egs/yesno/s5/data/test_yesno/split1/1/feats.scp ark:- | /opt/kaldi/src/featbin/add-deltas ark:- ark:- |" \
ark,t:-

渡しているパラメータについては前回の記事に記載。

結果(lattice無し)

1_0_0_0_0_0_0_0 3 2 2 2 2 2 2 2 
1_0_0_0_0_0_0_0 YES NO NO NO NO NO NO NO 
LOG (gmm-decode-faster[5.3.106~1389-9e2d8]:main():gmm-decode-faster.cc:196) Log-like per frame for utterance 1_0_0_0_0_0_0_0 is -8.37946 over 668 frames.

「3」はwords.txt上の"YES"、「2」は"NO"に対応

decode(lattice有り)

/opt/kaldi/src/gmmbin/gmm-latgen-faster \
--word-symbol-table=/opt/kaldi/egs/yesno/s5/exp/mono0a/graph_tgpr/words.txt \
/opt/kaldi/egs/yesno/s5/exp/mono0a/final.mdl \
/opt/kaldi/egs/yesno/s5/exp/mono0a/graph_tgpr/HCLG.fst \
"ark,s,cs:/opt/kaldi/src/featbin/apply-cmvn --utt2spk=ark:/opt/kaldi/egs/yesno/s5/data/test_yesno/split1/1/utt2spk scp:/opt/kaldi/egs/yesno/s5/data/test_yesno/split1/1/cmvn.scp scp:/opt/kaldi/egs/yesno/s5/data/test_yesno/split1/1/feats.scp ark:- | /opt/kaldi/src/featbin/add-deltas ark:- ark:- |" \
ark,t:-

結果(lattice有り)

1_0_0_0_0_0_0_0 YES NO NO NO NO NO NO NO 
1_0_0_0_0_0_0_0 
0	1	3	9.34174, 10746.4,	4_1_1_1_1_1_16_18_<snip>
1	2	2	3.00029,  3604.42,	15_15_15_15_15_15_<snip>
2	3	2	3.75534,   460.406,	29_29
3	4	2	6.37105,   626.19,
4	5	2	5.32006,   589.474,
5	6	2	5.67636,  4377.79,
6	7	2	5.32006,   596.049,
7	8	2	4.3186,   6239.1,
8	9	2	5.85963,  5268.64,	29_29_29_29
8			9.50533, 28208.8,	29_29_29_29_4_1_1_1_<snip>
9			7.30095, 22958.9,	26_28_30_4_16_15_15_<snip>

LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance 1_0_0_0_0_0_0_0 is -8.37946 over 668 frames.

モデル(egs/yesno/s5/exp/mono0a/final.mdl)をテキスト形式にしたもの

<TransitionModel> 
<Topology> 
<TopologyEntry> 
<ForPhones> 
2 3 
</ForPhones> 
<State> 0 <PdfClass> 0 <Transition> 0 0.75 <Transition> 1 0.25 </State> 
<State> 1 <PdfClass> 1 <Transition> 1 0.75 <Transition> 2 0.25 </State> 
<State> 2 <PdfClass> 2 <Transition> 2 0.75 <Transition> 3 0.25 </State> 
<State> 3 </State> 
</TopologyEntry> 
<TopologyEntry> 
<ForPhones> 
1 
</ForPhones> 
<State> 0 <PdfClass> 0 <Transition> 0 0.25 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 </State> 
<State> 1 <PdfClass> 1 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State> 
<State> 2 <PdfClass> 2 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State> 
<State> 3 <PdfClass> 3 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State> 
<State> 4 <PdfClass> 4 <Transition> 4 0.75 <Transition> 5 0.25 </State> 
<State> 5 </State> 
</TopologyEntry> 
</Topology> 
<Triples> 11 
1 0 0 
1 1 1 
1 2 2 
1 3 3 
1 4 4 
2 0 5 
2 1 6 
2 2 7 
3 0 8 
3 1 9 
3 2 10 
</Triples> 
<LogProbs> 
 [ 0 -0.3016863 -4.60517 -2.116771 -2.040137 -0.05096635 -4.60517 -3.516702 -4.60517 -4.60517 -0.09362812 -2.668062 -4.60517 -4.60517 -4.60517 -0.1123881 -2.449803 -0.04502614 -3.122941 -0.3431785 -1.236192 -0.1315082 -2.09372 -0.07189104 -2.668334 -0.1359556 -2.062634 -0.09793975 -2.371973 -0.04792399 -3.062005 ]
</LogProbs> 
</TransitionModel> 
<DIMENSION> 39 <NUMPDFS> 11
<DiagGMM> 
<GCONSTS>  [ -162.6711 -100.3258 -150.894 -774.145 <snip> ]
<WEIGHTS>  [ 0.02608728 0.03167231 0.03214631 0.03326807 0.01074118 <snip>]
<MEANS_INVVARS>  [
  -3.798081 -5.357131 0.8406813 0.918729 1.014658 "snip"
  0.5328674 1.181959 -0.6352269 -0.7017035 -0.06531551 "snip" ]
<INV_VARS>  [
  0.2399497 0.4042536 0.2387805 0.09193342 0.04029746 "snip"
  0.282881 0.1213772 0.07582887 0.03232023 0.03635461 "snip" ]
</DiagGMM> 
<DiagGMM> "snip" </DiagGMM>   ( 10 times repeat )

「YES」(/jes/)を「j-e+s」のように区切るのではなく、まとまりとして扱っている。