音声認識メモ(Kaldi)その13（特徴量変換 Dan's DNN(nnet2)）

Dan氏による実装版(nnet2)の中身を見てみる。

公式ドキュメントの説明によると、インプットとなる特徴量(MFCC)に対して、次の変換を行うとある。
Dan氏の論文でいうところの、baseline/Type I featuresにあたるようだ。

splice
LDA(linear discriminant Analysis;線形判別分析)
MLLT（maximum likelihood linear transform;最尤線形変換）/ global STC(semi-tied covariance;半結合共分散)
fMLLR(feature space maximum likelihood linear regression;最尤線形回帰)

論文では４パターンについてword error rate （WER）を比較しており、もっともパフォーマンスが良かったのが以下のパターン。

The Type-IV features consist of our baseline 40-dimensional speaker adapted features that have been spliced again,followed by de-correlation and dimensionality reduction using another LDA.

baselineとなる特徴量に対し、さらにspliceを適用し、（2度目の）LDA変換を適用する。

2度目のLDA変換については、Neural Networkモデルの"FixedAffineComponent"の部分に該当する。
この部分はトレーニングの過程で更新されない（fixed in advance and not trainable）

モデル(抜粋)

<Nnet> <NumComponents> 7 <Components> 
<SpliceComponent> 
    <InputDim> 40 <Context> [ -4 -3 -2 -1 0 1 2 3 4 ] <ConstComponentDim> 0
</SpliceComponent> 
<FixedAffineComponent> 
    <LinearParams>
    [ 0.1481841 0.1649369 (snip)
      (snip)
      -0.0002072983 0.0001211765 (snip) ]
    <BiasParams>
    [ 7.769857 5.612672 (snip) ]
</FixedAffineComponent> 
(snip)
</Components> </Nnet>

仮に、baselineとなる特徴量が「40」次元だったとすると、2度目のsplice後は「360次元」（40 x 9フレーム）
2度目のLDA変換を次元削減無しで実行した場合、”LinearParams”パラメータは「360row x 360col」、"BiasParams"パラメータは「1row x 360col」となる。

LDA変換用データは「steps/nnet2/get_lda.sh」内部でコールされる「src/nnet2bin/nnet-get-feature-transform」コマンドで生成する。

Get feature-projection transform using stats obtained with acc-lda.
See comments in the code of nnet2/get-feature-transform.h for more information.
Usage:  nnet-get-feature-transform [options] <matrix-out> <lda-acc-1> <lda-acc-2> ...

ここで、LDAは次元削減（reduce the dimensionality）のためでなく、無相関化(decorrelated the data)のために実施するとある。

ラッパースクリプト「steps/nnet2/get_lda.sh」を見ても、デフォルトで次元削減は行わないようになっている。

lda_dim=  # This defaults to no dimension reduction.

論文にもあるとおり、DNNトレーニングのインプットとしては、次元削減よりも白色化の方が有益であるらしい。

ichou1のブログ

主に音声認識、時々、データ分析のことを書く

音声認識メモ(Kaldi)その13（特徴量変換 Dan's DNN(nnet2)）

モデル(抜粋)