Scaled Dot-Product Attention is proposed in paper: Attention Is All You Need
Scaled Dot-Product Attention is defined as:
How to understand Scaled Dot-Product Attention?
Scaled Dot-Product Attention contains three part:
1. Scaled
It means a Dot-Product is scaled. As to equation above, The \(QK^T\) is divied (scaled) by \(\sqrt{d_k}\).
Why we should scale dot-product of two vectors?
Because the value of two vector dot product may be very large, for example:
\[QK^T=1000\]
Then when we compute: e1000 may cause overflow problem.
2. Dot-Product
It means the computation of \(QK^T\).
3. Attention
It means the computation of \(softmax(\frac{QK^T}{\sqrt{d_k}})\).
The Thinking in Scaled Dot-Product Attention
Similar to attention above, we also can define our own scaled dot-product attention.
You only need to scale a dot product by \(\sqrt{d_k}\).
For example:
\(softmax(\frac{QK^T}{\sqrt{d_m}})\).
Where \(Q\in R^{m*n}\) and \(K \in R^{1*n}\), \(d_m\) is a scalar n.