Understand Jensen’s Inequality and Attention Mechanism in Deep Learning

Attention mechanism is an important method to improve the performance of deep learning model. However, there are two basic forms. They are:

\(s_i = \sum_{j=1}^nf(a_{ij}w_{ij}) \) (1)

\(s_i = \sum_{j=1}^na_{ij}f(w_{ij})\) (2)

where \(a_{ij}\) is the attention weight of word \(w_{ij}\).

Which form is better? Equation (1) or (2)?

We can find the answer from Jensen’s Inequality.

Jensen’s Inequality

As to convex functions, Jensen’s Inequality is:

Here is the full text:

http://www.cse.yorku.ca/~kosta/CompVis_Notes/jensen.pdf

The illustrative example looks like:

As to deep learning, the loss function is a convex function, in order to get the minimum value by SGD. We can use Equation (1).

In order to determine a function is convex or not, we can compute the second derivative of it.