An Explain to Layer Normalization in Neural Networks

Layer Normalization is proposed by Jimmy Ba et al. in 2016. Here is the paper:

https://arxiv.org/abs/1607.06450

In this tutorial, we will introduce it for machine learning beginners.

Layer Normalization in Neural Networks

Layer Normalization is common used right now, for example, as to multi-head attention network.

Layer Normalization is applied in each layer.

Layer Normalization can be viewed as:

It means y_i = LN(x_i)

In neural networks, The l-th layer can be computed as:

where w_i^l is the weight matrix of l-th layer, b_i^l is the bias, f is the activation function.

In order to normalize the l-th layer, we can normalize a_i^l as follows:

where H denotes the number of hidden units in a layer. ε can be 0 or 1e-12. g^l is a gain parameters. Θ is the element-wise multiplication between two vector.

You should notice: g^l may be ignored if you do not want to scale normalization.

In RNN, the t-th time step can be normalized as:

We can use tf.contrib.layers.layer_norm() to implement it.