An Introduction to GE2E Loss for Beginners – Deep Learning Tutorial

By | March 14, 2022

GE2E loss is proposed in paper << GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION>>. In this tutorial, we will introduce it for some beginners.

How to compute similarity matrix?

Similarity matrix is defined as:

How to compute similarity matrix in GE2E

In this paper, we can find w and b is initialized with (w; b) = (10; 5) , the value of w should be > 0

Here is an example code:

embedded_split = tf.reshape(embedded, shape=[N, M, P])
center = normalize(tf.reduce_mean(embedded_split, axis=1))              # [N,P] normalized center vectors eq.(1)
center_except = normalize(tf.reshape(tf.reduce_sum(embedded_split, axis=1, keep_dims=True)
                                             - embedded_split, shape=[N*M,P]))  # [NM,P] center vectors eq.(8)
# make similarity matrix eq.(9)
S = tf.concat(
    [tf.concat([tf.reduce_sum(center_except[i*M:(i+1)*M,:]*embedded_split[j,:,:], axis=1, keep_dims=True) if i==j
                else tf.reduce_sum(center[i:(i+1),:]*embedded_split[j,:,:], axis=1, keep_dims=True) for i in range(N)],
                axis=1) for j in range(N)], axis=0)
				
S = tf.abs(w)*S+b   # rescaling

Here embedded is the embedding of speakers. S is the final similarity matrix.

GE2E Softmax

It is computed as follows:

GE2E Softmax

Here is the tensorflow code:

S_correct = tf.concat([S[i*M:(i+1)*M, i:(i+1)] for i in range(N)], axis=0)  # colored entries in Fig.1
total = -tf.reduce_sum(S_correct-tf.log(tf.reduce_sum(tf.exp(S), axis=1, keep_dims=True) + 1e-6))

GE2E Contrast

It is computed as follows:

GE2E Contrast

Here is the tensorflow code:

S_correct = tf.concat([S[i*M:(i+1)*M, i:(i+1)] for i in range(N)], axis=0)  # colored entries in Fig.1
S_sig = tf.sigmoid(S)
S_sig = tf.concat([tf.concat([0*S_sig[i*M:(i+1)*M, j:(j+1)] if i==j
                              else S_sig[i*M:(i+1)*M, j:(j+1)] for j in range(N)], axis=1)
                             for i in range(N)], axis=0)
total = tf.reduce_sum(1-tf.sigmoid(S_correct)+tf.reduce_max(S_sig, axis=1, keep_dims=True))

GE2E Softmax Vs Contrast

GE2E Softmax GE2E Contrast
the similarity values of colored areas to be large, and the values of gray areas to be small the blue embedding vector to be close to its own speaker’s centroid (blue triangle), and far from the others centroids (red and purple triangles), especially the closest one (red triangle)
slightly better for TI-SV better for TD-SV

An Introduction to GE2E Loss

TI-SV: text-independent speaker verification

TD-SV: text-dependent speaker verification

How to train GE2E

From this paper, we can find some basic settings:

  • N = 64, M = 10 utterances per speaker.
  • learning rate =  0.01 and decrease it by half every 30M steps.
  • The L2-norm of gradient is clipped at 3
  • the gradient scale for projection node in LSTM is set to 0.5

Leave a Reply