音声認識メモ（DeepSpeech）その2

モデルの構造

DeepSpeechの「create_model」関数を追ってみると、modelの構造としては「6層」

Layer1: Dense ( + clipped RELU activation + dropout )
Layer2: Dense ( + clipped RELU activation + dropout )
Layer3: Dense ( + clipped RELU activation + dropout )
Layer4: LSTM
Layer5: Dense ( + clipped RELU activation + dropout )
Layer6: Dense

Input shapeは以下のとおり。

# n_input : Number of MFCC features (Default 26)
# n_context : The number of frames in the context (Default 9)
[batch_size, n_steps, n_input + 2*n_input*n_context]

レイヤ4のLSTMは「tensorflow.contrib」モジュールを使っている。

LSTMBlockFusedCell （CPUの場合、tensorflow/contrib/rnn/python/ops/lstm_ops.pyで実装）
CudnnLSTM （GPUの場合、tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.pyで実装）

TF2系で動かす

環境として「TensorFlow 2.X」系を使っているとすると、「tensorflow.contrib」モジュールは使用できない。

「LSTMBlockFusedCell」に関してはissuesが上がっている（これを書いている時点ではOpen中）
Will LSTMBlockFusedCell be supported in tensorflow 2.0? · Issue #26642 · tensorflow/tensorflow · GitHub

これによると、「tf.raw_ops.BlockLSTM」で代替できそうではある。
tf.raw_ops.BlockLSTM | TensorFlow Core v2.3.0

ということで、「tf.raw_ops.BlockLSTM」に置き換えたモデルを組み立てて、パラメータをcheckpointから読み込んで推論させてみた。
結果は...うまく動かなかった。

どこに問題があるか

checkpointのパラメータに問題があったと判断。
https://github.com/mozilla/DeepSpeech/releases/download/v0.7.0/deepspeech-0.7.0-checkpoint.tar.gz

確認の方法は以下をご参照。
TensorFlowメモ（checkpointの中身を確認する） - ichou1のブログ

（参考）「deepspeech-0.7.0-checkpoint.tar.gz」に含まれる一式

-rw-r--r-- 1 ichou1 ichou1       329  4月 23 20:53 alphabet.txt
-rw-r--r-- 1 ichou1 ichou1 701047148  4月 23 19:33 best_dev-732522.data-00000-of-00001
-rw-r--r-- 1 ichou1 ichou1      1475  4月 23 19:33 best_dev-732522.index
-rw-r--r-- 1 ichou1 ichou1   8687981  4月 23 19:33 best_dev-732522.meta
-rw-r--r-- 1 ichou1 ichou1        87  4月 23 19:33 best_dev_checkpoint
-rw-r--r-- 1 ichou1 ichou1        87  4月 23 20:51 checkpoint

Layer1

-----------
[param name]:  layer_1/bias
[param shape]:  (2048,)
-----------
[param name]:  layer_1/bias/Adam
[param shape]:  (2048,)
-----------
[param name]:  layer_1/bias/Adam_1
[param shape]:  (2048,)
-----------
[param name]:  layer_1/weights
[param shape]:  (494, 2048)
-----------
[param name]:  layer_1/weights/Adam
[param shape]:  (494, 2048)
-----------
[param name]:  layer_1/weights/Adam_1
[param shape]:  (494, 2048)

Layer2, 3

optimizer分は省略。

-----------
[param name]:  layer_2/bias
[param shape]:  (2048,)
-----------
[param name]:  layer_2/weights
[param shape]:  (2048, 2048)
-----------
[param name]:  layer_3/bias
[param shape]:  (2048,)
-----------
[param name]:  layer_3/weights
[param shape]:  (2048, 2048)

Layer5, 6

optimizer分は省略。

-----------
[key name]:  layer_5/bias
[weight]:  (2048,)
-----------
[key name]:  layer_5/weights
[weight]:  (2048, 2048)
-----------
[key name]:  layer_6/bias
[weight]:  (29,)
-----------
[key name]:  layer_6/weights
[weight]:  (2048, 29)

Layer4 (LSTM)

optimizer分も記載。

以下は、GPUの場合の「CudnnLSTM」レイヤに該当するものと思われる。

-----------
[param name]:  cudnn_lstm/opaque_kernel
[param shape]:  (33570816,)
-----------
[param name]:  cudnn_lstm/opaque_kernel/Adam
[param shape]:  (33570816,)
-----------
[param name]:  cudnn_lstm/opaque_kernel/Adam_1
[param shape]:  (33570816,)

以下は、CPUの場合の「LSTMBlockFusedCell」レイヤに該当するものと思われるが、
optimizer分のデータがなく、トレーニングによって更新されていないと予想される。

-----------
[param name]:  cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
[param shape]:  (8192,)
-----------
[param name]:  cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
[param shape]:  (4096, 8192)

checkpointに関してはバージョン「0.6.0」リリースドキュメントで記載がある。
https://github.com/mozilla/DeepSpeech/releases https://github.com/mozilla/DeepSpeech/releases?after=v0.6.1-alpha.0

Checkpoints - With TF 1.14, we have added CuDNN RNN support to our training graph, which improves training performance significantly.
(中略)
The required training graph changes breaks loading older checkpoints, due to differences in the computation performed by CudnnLSTM.
Note that mixing CuDNN and non-CuDNN checkpoints requires some care: (以降、省略)

Fine-Tuning（転移学習）でトレーニング済みのパラメータを使おうとするなら、（GPUの）「CudnnLSTM」レイヤを使ってモデルを組み立てる必要がありそう。