Understand End-To-End Memory Networks – Part 2 – A Simple Tutorial for NLP Beginners

4. How to compute output O?

In memory network, output o is computed as:

where c_i is the vector of each word in sentence X.

5.How to compute c_i?

Like get m_i, we can compute c_i with a matrix C(V*d)

Notice:

(1) matrix C is like matrix A, it is a variable and trained in model. It represents vector of each word in vocabrary.

self.C = tf.Variable(tf.random_normal([self.nwords, self.edim], stddev=self.init_std)) # Embedding C for sentences

(2) If you use pretrained word vector, we can map c_i like
c_i = Cx_i

It is computed as:

Because when test this network, we inputs X and Q

o contains feature belong to X based on Q

u contains feature belong to Q

so use o+u can enhance the result, however, we also can only use o, which contains information X and Q.

In network, matrix A represents vector of each word with attention weight on Q

C is only vector of each word, do not contain other information and featrue.

In paper: https://arxiv.org/abs/1503.08895

A ≠ C

The key problem is p_i, in this paper, matrix A contains word vector and word attention weight information based on Q.

but if p_i is computed as

p_i=softmax(u^TMm_i)

M is the attention matrix

Here we can set A=C in this model.

You must use new matrix A,B,C. Compute u as: