Kerasメモ(BERTその4)
前回の続き。Transformerを構成するFeedForwardレイヤを見てみる。
論文「Attention Is All You Need」からの抜粋。
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.
This consists of two linear transformations with a ReLU activation in between.
以下の線形変換(Linear)を計算する。
モデル内の重みは以下のとおり(ソース全体)
keras_position_wise_feed_forward/feed_forward.py
class FeedForward(keras.layers.Layer): def __init__(self, units, activation='relu', use_bias=True, kernel_initializer='glorot_normal', bias_initializer='zeros', ... def build(self, input_shape): feature_dim = int(input_shape[-1]) self.W1 = self.add_weight(shape=(feature_dim, self.units), ...) self.b1 = self.add_weight(shape=(self.units,), ...) self.W2 = self.add_weight(shape=(self.units, feature_dim), ...) self.b2 = self.add_weight(shape=(feature_dim,), ...)
model.summary
________________________________________________________________ Layer (type) Output Shape Param # ================================================================ Encoder-FeedForward (FeedForward) (None, 512, 768) 4722432 ================================================================ # W1 : (768, 3072) # b1 : (3072, ) # W2 : (3072, 768) # b2 : (768, )
While the linear transformations are the same across different positions, they use different parameters from layer to layer.
それぞれの重みは、「transformer_num」個ある各レイヤで異なる。
論文では「Linear transformations」という表現が使われていて、「Affine」との違いに迷ったのでメモしておく。
What is the difference between linear and affine function - Mathematics Stack Exchange
「Linear」は原点固定(bias無し)、「Affine」は原点移動(bias有り)、といったところだろうか。