An Introduction to Scaled Dot-Product Attention in Deep Learning – Deep Learning Tutorial

By | October 11, 2020

Scaled Dot-Product Attention is proposed in paper: Attention Is All You Need

Scaled Dot-Product Attention is defined as:

Scaled Dot-Product Attention

How to understand Scaled Dot-Product Attention?

Scaled Dot-Product Attention contains three part:

1. Scaled

It means a Dot-Product is scaled. As to equation above, The \(QK^T\) is divied (scaled) by \(\sqrt{d_k}\).

Why we should scale dot-product of two vectors?

Because the value of two vector dot product may be very large, for example:

\[QK^T=1000\]

Then when we compute: e1000 may cause overflow problem.

2. Dot-Product

It means the computation of \(QK^T\).

3. Attention

It means the computation of \(softmax(\frac{QK^T}{\sqrt{d_k}})\).

The Thinking in Scaled Dot-Product Attention

Similar to attention above, we also can define our own scaled dot-product attention.

You only need to scale a dot product by \(\sqrt{d_k}\).

For example:

\(softmax(\frac{QK^T}{\sqrt{d_m}})\).

Where \(Q\in R^{m*n}\) and \(K \in R^{1*n}\), \(d_m\) is a scalar n.

Leave a Reply