An Introduction to GE2E Loss for Beginners

GE2E loss is proposed in paper << GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION>>. In this tutorial, we will introduce it for some beginners.

How to compute similarity matrix?

Similarity matrix is defined as:

In this paper, we can find w and b is initialized with (w; b) = (10; –5) , the value of w should be > 0

Here is an example code:

embedded_split = tf.reshape(embedded, shape=[N, M, P])
center = normalize(tf.reduce_mean(embedded_split, axis=1))              # [N,P] normalized center vectors eq.(1)
center_except = normalize(tf.reshape(tf.reduce_sum(embedded_split, axis=1, keep_dims=True)
                                             - embedded_split, shape=[N*M,P]))  # [NM,P] center vectors eq.(8)
# make similarity matrix eq.(9)
S = tf.concat(
    [tf.concat([tf.reduce_sum(center_except[i*M:(i+1)*M,:]*embedded_split[j,:,:], axis=1, keep_dims=True) if i==j
                else tf.reduce_sum(center[i:(i+1),:]*embedded_split[j,:,:], axis=1, keep_dims=True) for i in range(N)],
                axis=1) for j in range(N)], axis=0)
				
S = tf.abs(w)*S+b   # rescaling

Here embedded is the embedding of speakers. S is the final similarity matrix.

GE2E Softmax

It is computed as follows:

Here is the tensorflow code:

S_correct = tf.concat([S[i*M:(i+1)*M, i:(i+1)] for i in range(N)], axis=0)  # colored entries in Fig.1
total = -tf.reduce_sum(S_correct-tf.log(tf.reduce_sum(tf.exp(S), axis=1, keep_dims=True) + 1e-6))

GE2E Contrast

It is computed as follows:

Here is the tensorflow code:

S_correct = tf.concat([S[i*M:(i+1)*M, i:(i+1)] for i in range(N)], axis=0)  # colored entries in Fig.1
S_sig = tf.sigmoid(S)
S_sig = tf.concat([tf.concat([0*S_sig[i*M:(i+1)*M, j:(j+1)] if i==j
                              else S_sig[i*M:(i+1)*M, j:(j+1)] for j in range(N)], axis=1)
                             for i in range(N)], axis=0)
total = tf.reduce_sum(1-tf.sigmoid(S_correct)+tf.reduce_max(S_sig, axis=1, keep_dims=True))

GE2E Softmax Vs Contrast

GE2E Softmax	GE2E Contrast
the similarity values of colored areas to be large, and the values of gray areas to be small	the blue embedding vector to be close to its own speaker’s centroid (blue triangle), and far from the others centroids (red and purple triangles), especially the closest one (red triangle)
slightly better for TI-SV	better for TD-SV

TI-SV: text-independent speaker verification

TD-SV: text-dependent speaker verification

How to train GE2E

From this paper, we can find some basic settings:

N = 64, M = 10 utterances per speaker.
learning rate = 0.01 and decrease it by half every 30M steps.
The L2-norm of gradient is clipped at 3
the gradient scale for projection node in LSTM is set to 0.5

An Introduction to GE2E Loss for Beginners – Deep Learning Tutorial

How to compute similarity matrix?

GE2E Softmax

GE2E Contrast

GE2E Softmax Vs Contrast

How to train GE2E

Leave a Reply Cancel reply