Kerasメモ（BERTその4）

前回の続き。Transformerを構成するFeedForwardレイヤを見てみる。

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.
This consists of two linear transformations with a ReLU activation in between.

以下の線形変換（Linear）を計算する。

$FFN(x) = max(0, \ xW_1 + b_1) \ W_2 + b_2$

モデル内の重みは以下のとおり（ソース全体）

keras_position_wise_feed_forward/feed_forward.py

class FeedForward(keras.layers.Layer):

    def __init__(self,
                 units,
                 activation='relu',
                 use_bias=True,
                 kernel_initializer='glorot_normal',
                 bias_initializer='zeros',
                 ...

    def build(self, input_shape):
        feature_dim = int(input_shape[-1])
        self.W1 = self.add_weight(shape=(feature_dim, self.units), ...)
        self.b1 = self.add_weight(shape=(self.units,), ...)
        self.W2 = self.add_weight(shape=(self.units, feature_dim), ...)
        self.b2 = self.add_weight(shape=(feature_dim,), ...)

model.summary

________________________________________________________________
Layer (type)                         Output Shape        Param #
================================================================
Encoder-FeedForward (FeedForward)    (None, 512, 768)    4722432
================================================================
# W1 : (768, 3072)
# b1 : (3072, )
# W2 : (3072, 768)
# b2 : (768, )

While the linear transformations are the same across different positions, they use different parameters from layer to layer.

それぞれの重みは、「transformer_num」個ある各レイヤで異なる。

論文では「Linear transformations」という表現が使われていて、「Affine」との違いに迷ったのでメモしておく。
What is the difference between linear and affine function - Mathematics Stack Exchange

「Linear」は原点固定（bias無し）、「Affine」は原点移動（bias有り）、といったところだろうか。