ichou1のブログ

主に音声認識、時々、データ分析のことを書く

Kerasメモ(BERTその4)

前回の続き。Transformerを構成するFeedForwardレイヤを見てみる。

論文「Attention Is All You Need」からの抜粋。

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.
This consists of two linear transformations with a ReLU activation in between.

以下の線形変換(Linear)を計算する。



FFN(x) = max(0, \ xW_1 + b_1) \ W_2 + b_2


モデル内の重みは以下のとおり(ソース全体

keras_position_wise_feed_forward/feed_forward.py
class FeedForward(keras.layers.Layer):

    def __init__(self,
                 units,
                 activation='relu',
                 use_bias=True,
                 kernel_initializer='glorot_normal',
                 bias_initializer='zeros',
                 ...

    def build(self, input_shape):
        feature_dim = int(input_shape[-1])
        self.W1 = self.add_weight(shape=(feature_dim, self.units), ...)
        self.b1 = self.add_weight(shape=(self.units,), ...)
        self.W2 = self.add_weight(shape=(self.units, feature_dim), ...)
        self.b2 = self.add_weight(shape=(feature_dim,), ...)
model.summary
________________________________________________________________
Layer (type)                         Output Shape        Param #
================================================================
Encoder-FeedForward (FeedForward)    (None, 512, 768)    4722432
================================================================
# W1 : (768, 3072)
# b1 : (3072, )
# W2 : (3072, 768)
# b2 : (768, )



While the linear transformations are the same across different positions, they use different parameters from layer to layer.

それぞれの重みは、「transformer_num」個ある各レイヤで異なる。



論文では「Linear transformations」という表現が使われていて、「Affine」との違いに迷ったのでメモしておく。
What is the difference between linear and affine function - Mathematics Stack Exchange

「Linear」は原点固定(bias無し)、「Affine」は原点移動(bias有り)、といったところだろうか。