音声認識メモ(Kaldi)その21(音韻モデル)

認識対象を孤立単語から発展させて、もう少し実用的な使い方を試してみる。

まず、トレーニング対象の発話を音素に分解する。

元の文

オススメの料理は何ですか

文を単語へ分解（分かち書き、「MeCab」を使用）

オススメ の　料理 は 何 です か

単語を音素列へ分解（「Julius」付属の「yomi2voca.pl」を使用）

オススメ	o s u s u m e
の	n o
料理	ry o u r i
は	w a
何	n a N
です	d e s u
か	k a

音素については、無音（sp、sil）を除けば「40」種類。

音素表

f:id:ichou1:20180722113500p:plain
(引用元)
http://winnie.kuis.kyoto-u.ac.jp/dictation/doc/phone_m.pdf

この内、「dy」（「ぢゃ」、「ぢゅ」、「ぢょ」）については対象外とし、
「39」種類の音素全部を使って20種類の文を作り、それぞれ3回ずつ音声データ60個分を用意した。

発話文（20種類）

オススメの料理は何ですか

o s u s u m e   n o   ry o u r i   w a   n a N   d e s u   k a

百十番テーブルへどうぞ

hy a k u   j u u   b a N   t e: b u r u   e   d o u z o

ラーメンと餃子のセットを１つお願いします

r a: m e N   t o   gy o u z a   n o   s e q t o   o   h i t o ts u   o n e g a i   sh i   m a s u

メニューお願いします

m e ny u:   o n e g a i   sh i   m a s u

禁煙席お願いします

k i N e N   s e k i   o n e g a i   sh i   m a s u

お水４つお願いします

o   m i z u   y o q ts u   o n e g a i   sh i   m a s u

フォーク２つお願いします

f o: k u   f u t a ts u   o n e g a i   sh i   m a s u

コーヒーは食後にお願いします

k o: h i:   w a   sh o k u g o   n i   o n e g a i   sh i   m a s u

ソフトドリンクはありますか

s o f u t o d o r i N k u   w a   a r i   m a s u   k a

持ち帰りにできますか

m o ch i k a e r i   n i   d e k i   m a s u   k a

別々にできますか

b e ts u b e ts u   n i   d e k i   m a s u  k a

ごちそうさまでした

g o ch i s o u s a m a   d e sh i   t a

牛肉にしてください

gy u u n i k u   n i   sh i   t e   k u d a s a i

キャベツはお代わり自由です

ky a b e ts u   w a   o   k a w a r i   j i y u u   d e s u

サプライズはできますか

s a p u r a i z u   w a   d e k i   m a s u   k a

シャンパンをください

sh a N p a N   o   k u d a s a i

かんぴょうをください

k a N py o u   o  k u d a s a i

食事はビュッフェスタイルです

sh o k u j i   w a   by u q f e   s u t a i r u   d e s u

みょうがを添えてください

my o u g a   o   s o e   t e   k u d a s a i

コーヒーと紅茶どちらにしますか

k o: h i:   t o   k o u ch a   d o ch i r a   n i   sh i   m a s u   k a

(参考) data/lang/phones.txt

<eps> 0
sil 1
sil_B 2
sil_E 3
sil_I 4
sil_S 5
spn 6
spn_B 7
spn_E 8
spn_I 9
spn_S 10
N_B 11
N_E 12
N_I 13
N_S 14
a_B 15
a_E 16
a_I 17
a_S 18
<snip>
z_B 163
z_E 164
z_I 165
z_S 166

「禁煙席お願いします」という音声データ1個分を検証用データ、
残り59個の音声データをトレーニング用として試したところ、各モデルでのデコード結果は以下のとおりとなった。

モノフォン(mono)

1-gram

utterance_id_053 禁煙 お願い し ます 
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -8.01345 over 323 frames.

2-gram

utterance_id_053 禁煙 席 お願い し ます 
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -7.97456 over 323 frames.

gmm-info実行結果

gmm-info exp/mono/final.mdl 
number of phones 166
number of pdfs 127
number of transition-ids 1116
number of transition-states 518
feature dimension 39
number of gaussians 1004

モデルのNUMPDFS(pdf-class数) : 127 ( 5hmm_state * 2phone + 3hmm_state * 29phone )

トライフォン(tri1)

1-gram

utterance_id_053 禁煙 お願い し ます 
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -7.9652 over 323 frames.

2-gram

utterance_id_053 禁煙 席 お願い し ます 
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -7.92976 over 323 frames.

gmm-info実行結果

gmm-info exp/tri1/final.mdl 
number of phones 166
number of pdfs 152
number of transition-ids 1740
number of transition-states 830
feature dimension 39
number of gaussians 977

トライフォン(tri2b、LDA+MLLT)

1-gram

utterance_id_053 禁煙 お願い し ます 
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -4.92207 over 323 frames.

2-gram

utterance_id_053 禁煙 お願い し ます 
LOG (gmm-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -4.89113 over 323 frames.

gmm-info実行結果

gmm-info exp/tri2b/final.mdl 
number of phones 166
number of pdfs 168
number of transition-ids 2246
number of transition-states 1083
feature dimension 40
number of gaussians 970

DNN(nnet4c)

1-gram

utterance_id_053 禁煙 お願い し ます 
LOG (nnet-latgen-faster[5.3.106~1389-9e2d8]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:286) Log-like per frame for utterance utterance_id_053 is -0.543315 over 323 frames.

nnet-am-info実行結果

nnet-am-info exp/nnet4c/final.mdl 
num-components 9
num-updatable-components 3
left-context 4
right-context 4
input-dim 40
output-dim 192
parameter-dim 446703
component 0 : SpliceComponent, input-dim=40, output-dim=360, context=-4 -3 -2 -1 0 1 2 3 4 
component 1 : FixedAffineComponent, input-dim=360, output-dim=360, <snip>
component 2 : AffineComponentPreconditionedOnline, input-dim=360, output-dim=375, <snip>
component 3 : TanhComponent, input-dim=375, output-dim=375
component 4 : AffineComponentPreconditionedOnline, input-dim=375, output-dim=375, <snip>
component 5 : TanhComponent, input-dim=375, output-dim=375
component 6 : AffineComponentPreconditionedOnline, input-dim=375, output-dim=453, <snip>
component 7 : SoftmaxComponent, input-dim=453, output-dim=453
component 8 : SumGroupComponent, input-dim=453, output-dim=192
prior dimension: 192, prior sum: 1, prior min: 1e-20

"禁煙"の後ろの"席"が欠落してしまっているケースがあるが、トレーニング用データ数や重みパラメータによって結果は変わってくると予想される。

また、今回、例えば「料理」という単語に関しては、「ry o u r i」としたが、「ry o: r i」（りょーり）でも認識できるようにすると良いと思われる。
このあたりは、話し言葉という領域の奥深さを感じる。

ichou1のブログ

主に音声認識、時々、データ分析のことを書く

音声認識メモ(Kaldi)その21(音韻モデル)

元の文

文を単語へ分解（分かち書き、「MeCab」を使用）

単語を音素列へ分解（「Julius」付属の「yomi2voca.pl」を使用）

音素表

発話文（20種類）

(参考) data/lang/phones.txt

モノフォン(mono)

1-gram

2-gram

gmm-info実行結果

トライフォン(tri1)

1-gram

2-gram

gmm-info実行結果

トライフォン(tri2b、LDA+MLLT)

1-gram

2-gram

gmm-info実行結果

DNN(nnet4c)

1-gram

nnet-am-info実行結果