An Introduction to Scaled Dot-Product Attention in Deep Learning

Scaled Dot-Product Attention is proposed in paper: Attention Is All You Need

Scaled Dot-Product Attention is defined as:

How to understand Scaled Dot-Product Attention?

Scaled Dot-Product Attention contains three part:

It means a Dot-Product is scaled. As to equation above, The \(QK^T\) is divied (scaled) by \(\sqrt{d_k}\).

Because the value of two vector dot product may be very large, for example:

\[QK^T=1000\]

Then when we compute: e¹⁰⁰⁰ may cause overflow problem.

It means the computation of \(QK^T\).

It means the computation of \(softmax(\frac{QK^T}{\sqrt{d_k}})\).

Similar to attention above, we also can define our own scaled dot-product attention.

You only need to scale a dot product by \(\sqrt{d_k}\).

For example:

\(softmax(\frac{QK^T}{\sqrt{d_m}})\).

Where \(Q\in R^{m*n}\) and \(K \in R^{1*n}\), \(d_m\) is a scalar n.