Attention mechanism is an important method to improve the performance of deep learning model. However, there are two basic forms. They are:
\(s_i = \sum_{j=1}^nf(a_{ij}w_{ij}) \) (1)
or
\(s_i = \sum_{j=1}^na_{ij}f(w_{ij})\) (2)
where \(a_{ij}\) is the attention weight of word \(w_{ij}\).
Which form is better? Equation (1) or (2)?
We can find the answer from Jensen’s Inequality.
Jensen’s Inequality
As to convex functions, Jensen’s Inequality is:
Here is the full text:
http://www.cse.yorku.ca/~kosta/CompVis_Notes/jensen.pdf
The illustrative example looks like:
As to deep learning, the loss function is a convex function, in order to get the minimum value by SGD. We can use Equation (1).
In order to determine a function is convex or not, we can compute the second derivative of it.