音声合成メモ(tacotron2その2)
内部でどのような処理を行っているのか見てみる。
論文より(Encoder/Decoderの枠線を加筆)
「torchsummaryX」を使って、モデルのサマリを出力してみる。
各レイヤ構成
Embedding
(embedding): Embedding(148, 512)
Encoder
「BatchNorm1d」レイヤに関しては、1行におさまらないので"affine=True, track_running_stats=True"を省略。
(encoder): Encoder( (convolutions): ModuleList( (0): Sequential( (0): ConvNorm( (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,)) ) (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, ...) ) (1): Sequential( (0): ConvNorm( (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,)) ) (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, ...) ) (2): Sequential( (0): ConvNorm( (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,)) ) (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, ...) ) ) (lstm): LSTM(512, 256, batch_first=True, bidirectional=True) )
Decoder
(decoder): Decoder( (prenet): Prenet( (layers): ModuleList( (0): LinearNorm( (linear_layer): Linear(in_features=80, out_features=256, bias=False) ) (1): LinearNorm( (linear_layer): Linear(in_features=256, out_features=256, bias=False) ) ) ) (attention_rnn): LSTMCell(768, 1024) (attention_layer): Attention( (query_layer): LinearNorm( (linear_layer): Linear(in_features=1024, out_features=128, bias=False) ) (memory_layer): LinearNorm( (linear_layer): Linear(in_features=512, out_features=128, bias=False) ) (v): LinearNorm( (linear_layer): Linear(in_features=128, out_features=1, bias=False) ) (location_layer): LocationLayer( (location_conv): ConvNorm( (conv): Conv1d(2, 32, kernel_size=(31,), stride=(1,), padding=(15,), bias=False) ) (location_dense): LinearNorm( (linear_layer): Linear(in_features=32, out_features=128, bias=False) ) ) ) (decoder_rnn): LSTMCell(1536, 1024, bias=1) (linear_projection): LinearNorm( (linear_layer): Linear(in_features=1536, out_features=80, bias=True) ) (gate_layer): LinearNorm( (linear_layer): Linear(in_features=1536, out_features=1, bias=True) ) )
Postnet
(postnet): Postnet( (convolutions): ModuleList( (0): Sequential( (0): ConvNorm( (conv): Conv1d(80, 512, kernel_size=(5,), stride=(1,), padding=(2,)) ) (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, ...) ) (1): Sequential( (0): ConvNorm( (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,)) ) (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, ...) ) (2): Sequential( (0): ConvNorm( (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,)) ) (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, ...) ) (3): Sequential( (0): ConvNorm( (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,)) ) (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, ...) ) (4): Sequential( (0): ConvNorm( (conv): Conv1d(512, 80, kernel_size=(5,), stride=(1,), padding=(2,)) ) (1): BatchNorm1d(80, eps=1e-05, momentum=0.1, ...) ) ) )
各レイヤのOutput Shape
Embedding
Layer Output Shape -------------------------------------- -------------------------- 0_embedding [batch_size, max_len, 512]
Encoder
Layer Output Shape -------------------------------------- -------------------------- 1_encoder.convolutions.0.0.Conv1d_conv [batch_size, 512, max_len] 2_encoder.convolutions.0.BatchNorm1d_1 [batch_size, 512, max_len] 3_encoder.convolutions.1.0.Conv1d_conv [batch_size, 512, max_len] 4_encoder.convolutions.1.BatchNorm1d_1 [batch_size, 512, max_len] 5_encoder.convolutions.2.0.Conv1d_conv [batch_size, 512, max_len] 6_encoder.convolutions.2.BatchNorm1d_1 [batch_size, 512, max_len] 7_encoder.LSTM_lstm [batch_size, 512]
Decoder
「batch_size」 が 「64」、「max_len」が「164」であったとする。
Layer Output Shape ------------------------------------------------------------------------- -------------- 8_decoder.prenet.layers.0.Linear_linear_layer [870, 64, 256] 9_decoder.prenet.layers.1.Linear_linear_layer [870, 64, 256] 10_decoder.attention_layer.memory_layer.Linear_linear_layer [64, 1, 128]
Layer Output Shape ------------------------------------------------------------------------- -------------- 11_decoder.LSTMCell_attention_rnn [64, 1024] 12_decoder.attention_layer.query_layer.Linear_linear_layer [64, 1, 128] 13_decoder.attention_layer.location_layer.location_conv.Conv1d_conv [64, 32, 1] 14_decoder.attention_layer.location_layer.location_dense.Linear_linear_layer [64, 1, 128] 15_decoder.attention_layer.v.Linear_linear_layer [64, 1, 1] 16_decoder.LSTMCell_decoder_rnn [64, 1024] 17_decoder.linear_projection.Linear_linear_layer [64, 80] 18_decoder.gate_layer.Linear_linear_layer [64, 1]
上記を1ブロックとして、「869」ブロック、続く。
Layer Output Shape ------------------------------------------------------------------------- -------------- 6955_decoder.LSTMCell_attention_rnn [64, 1024] 6956_decoder.attention_layer.query_layer.Linear_linear_layer [64, 1, 128] 6957_decoder.attention_layer.location_layer.location_conv.Conv1d_conv [64, 32, 1] 6958_decoder.attention_layer.location_layer.location_dense.Linear_linear_... [64, 1, 128] 6959_decoder.attention_layer.v.Linear_linear_layer [64, 1, 1] 6960_decoder.LSTMCell_decoder_rnn [64, 1024] 6961_decoder.linear_projection.Linear_linear_layer [64, 80] 6962_decoder.gate_layer.Linear_linear_layer [64, 1]
Postnet
Layer Output Shape ----------------------------------------- ---------------------- 6963_postnet.convolutions.0.0.Conv1d_conv [batch_size, 512, 869] 6964_postnet.convolutions.0.BatchNorm1d_1 [batch_size, 512, 869] 6965_postnet.convolutions.1.0.Conv1d_conv [batch_size, 512, 869] 6966_postnet.convolutions.1.BatchNorm1d_1 [batch_size, 512, 869] 6967_postnet.convolutions.2.0.Conv1d_conv [batch_size, 512, 869] 6968_postnet.convolutions.2.BatchNorm1d_1 [batch_size, 512, 869] 6969_postnet.convolutions.3.0.Conv1d_conv [batch_size, 512, 869] 6970_postnet.convolutions.3.BatchNorm1d_1 [batch_size, 512, 869] 6971_postnet.convolutions.4.0.Conv1d_conv [batch_size, 80, 869] 6972_postnet.convolutions.4.BatchNorm1d_1 [batch_size, 80, 869]