Understand Gated Self-Attention for Beginners

Gated Self-Attention is an improvement of self-attention mechanism. In this tutorial, we will discuss it for deep learning beginners.

Gated self-attention

Gated self-attention contains two parts: Gated and self-attention

Gated is a sigmoid function, for example:

\(g_t = sigmoid(W[h_t,s_t])\)

Here we can fuse \(h_t\) and \(s_t\) as follows:

\(u_t = g_t \cdot h_t + (1-g_t) \cdot s_t\)

Here \(h_t\) or \(s_t\) can be computed by self-attention.

Moreover, you also can concatenate \(h_t\) and \(s_t\) to get \(u_t\).

Meanwhile, if the number of features is bigger than 2, you can use self-attention to compute the weight of each feature.

If you plan to fuse two features, you can use gated function to apply different weights for them.

Here is an example: