Post-Norm and Pre-Norm Residual Units Explained

In this tutorial, we will introduce post-norm and pre-norm residual units, they are often used to improve transformer in deep learning. In paper Learning Deep Transformer Models for Machine Translation you can find more detail.

Post-Norm

Post-Norm is defined as:

Pre-Norm

Pre-Norm is defined as:

Here LN() function is the layer normalization function. To implement layer normalization, you can view:

Layer Normalization Explained for Beginners – Deep Learning Tutorial

Which one is better?

Both of these methods are good choices for implementation of Transformer. In our experiments, they show comparable performance in BLEU for a
system based on a 6-layer encoder

In paper Transformers without Tears: Improving the Normalization of Self-Attention, we can find pre-norm is better.

For example: