In this tutorial, we will introduce post-norm and pre-norm residual units, they are often used to improve transformer in deep learning. In paper Learning Deep Transformer Models for Machine Translation you can find more detail.
Post-Norm
Post-Norm is defined as:
Pre-Norm
Pre-Norm is defined as:
Here LN() function is the layer normalization function. To implement layer normalization, you can view:
Layer Normalization Explained for Beginners – Deep Learning Tutorial
Which one is better?
Both of these methods are good choices for implementation of Transformer. In our experiments, they show comparable performance in BLEU for a
system based on a 6-layer encoder
In paper Transformers without Tears: Improving the Normalization of Self-Attention, we can find pre-norm is better.
For example:
In paper Conformer: Convolution-augmented Transformer for Speech Recognition, pre-norm is also be used.
We use prenorm residual units with dropout which helps training and regularizing deeper models