音声認識メモ(Kaldi)その1（特徴量抽出）

音声認識Tool Kit「Kaldi」を試してみる。

今回は特徴量抽出。
音声データはHTKのHCopyで試した時と同じものを使用。

Kaldi公式サイトには次の記述があり、全く同じにはならないよう。

With the option –htk-compat=true, and setting parameters correctly, it is possible to get very close to HTK features.

まずは、wavデータをアーカイブ形式で保存。

featbin/extract-segments scp:mosi1.scp mosi1_segment ark:mosi1.ark

続いて、MFCC特徴量を抽出。

featbin/compute-mfcc-feats --config=mfcc.conf ark:mosi1.ark ark:mosi1.mfcc

デフォルトオプション（MFCC）

（feat/feature-mfcc.h、struct MfccOptions）

struct MfccOptions {
    FrameExtractionOptions frame_opts;
    MelBanksOptions mel_opts;
    （途中省略）
    MfccOptions()
        : mel_opts(23),
          // defaults the #mel-banks to 23 for the MFCC computations.
          // this seems to be common for 16khz-sampled data,
          // but for 8khz-sampled data, 15 may be better.
          num_ceps(13),
          use_energy(true),
          energy_floor(0.0),
          raw_energy(true),
          cepstral_lifter(22.0),
          htk_compat(false) {}

    void Register(OptionsItf *opts) {
    （途中省略）
    }
};

デフォルトオプション(フレーム処理)

(feat/feature-window.h、struct FrameExtractionOptions)

struct FrameExtractionOptions {
    （途中省略）
    FrameExtractionOptions()
        : samp_freq(16000),
          frame_shift_ms(10.0),
          frame_length_ms(25.0),
          dither(1.0),
          preemph_coeff(0.97),
          remove_dc_offset(true),
          window_type("povey"),
          round_to_power_of_two(true),
          blackman_coeff(0.42),
          snip_edges(true),
          allow_downsample(false) { }

    void Register(OptionsItf *opts) {
    （途中省略）
    }
   （途中省略）
};

デフォルトオプション（フィルタバンク）

(feat/mel-computations.h、struct MelBanksOptions)

struct MelBanksOptions {
    （途中省略）
    explicit MelBanksOptions(int num_bins = 25)
        : num_bins(num_bins),
          low_freq(20),
          high_freq(0),
          vtln_low(100),
          vtln_high(-500),
          debug_mel(false),
          htk_mode(false) {}

    void Register(OptionsItf *opts) {
    （途中省略）
    }
};

パラメータで渡すconfig

--low-freq=0             # for 16kHz sampled speech.
--high-freq=8000         # for 16kHz sampled speech.
--window-type=hamming    # ハミング窓を使う
--use-energy=false       
--num-mel-bins=24
--htk-compat=true        # try to make it compatible with HTK
--dither=0
--remove_dc_offset=false

Kaldiではデフォルトでditheringを行うが、これをOFFにしてやれば、FFTの結果までは同じになるように見える。

その後のフィルタバンク処理において、HTKではamplitude（power spectrumの平方根）を使うのに対し、Kaldiではpower spectrumを使っているので値が違ってくる。binningや低周波数に対する重み付け処理の違いは未確認だが、フィルタバンクの結果を見ると、違いがあるのだろう。

フィルタバンク処理後(24次元)

f:id:ichou1:20180217100109p:plain

(参考)HCopyのフィルタバンク処理後
f:id:ichou1:20180217100349p:plain

その後はCepstral領域に移して重み付けをするのはHTKと同様。

特徴量ファイルをテキスト化する。

copy-feats ark:mosi1.mfcc ark,t:mosi1_mfcc.txt

mosi1_mfcc.txt（1番目のフレーム）

6.969446 -4.22486 2.13142 -5.186962 -8.526231 14.15422 6.308793 -9.442774 1.588198 -27.89196 -2.058122 -11.62059 89.52384

HTKの結果と比べてみる。Kaldiの方が特徴を識別しやすいように見える。
f:id:ichou1:20180218083735p:plain

オプション「--htk-compat=true」は、出力の並びをHTKと同じにするためのもの。

If true, put energy or C0 last and use a factor of sqrt(2) on C0.

HTKの並び

１次元、2次元、3次元、....、12次元、0次元

Kaldiの並び

0次元、１次元、2次元、3次元、....、12次元

ichou1のブログ

主に音声認識、時々、データ分析のことを書く