GE2E loss is proposed in paper << GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION>>. In this tutorial, we will introduce it for some beginners.
How to compute similarity matrix?
Similarity matrix is defined as:
In this paper, we can find w and b is initialized with (w; b) = (10; –5) , the value of w should be > 0
Here is an example code:
embedded_split = tf.reshape(embedded, shape=[N, M, P]) center = normalize(tf.reduce_mean(embedded_split, axis=1)) # [N,P] normalized center vectors eq.(1) center_except = normalize(tf.reshape(tf.reduce_sum(embedded_split, axis=1, keep_dims=True) - embedded_split, shape=[N*M,P])) # [NM,P] center vectors eq.(8) # make similarity matrix eq.(9) S = tf.concat( [tf.concat([tf.reduce_sum(center_except[i*M:(i+1)*M,:]*embedded_split[j,:,:], axis=1, keep_dims=True) if i==j else tf.reduce_sum(center[i:(i+1),:]*embedded_split[j,:,:], axis=1, keep_dims=True) for i in range(N)], axis=1) for j in range(N)], axis=0) S = tf.abs(w)*S+b # rescaling
Here embedded is the embedding of speakers. S is the final similarity matrix.
GE2E Softmax
It is computed as follows:
Here is the tensorflow code:
S_correct = tf.concat([S[i*M:(i+1)*M, i:(i+1)] for i in range(N)], axis=0) # colored entries in Fig.1 total = -tf.reduce_sum(S_correct-tf.log(tf.reduce_sum(tf.exp(S), axis=1, keep_dims=True) + 1e-6))
GE2E Contrast
It is computed as follows:
Here is the tensorflow code:
S_correct = tf.concat([S[i*M:(i+1)*M, i:(i+1)] for i in range(N)], axis=0) # colored entries in Fig.1 S_sig = tf.sigmoid(S) S_sig = tf.concat([tf.concat([0*S_sig[i*M:(i+1)*M, j:(j+1)] if i==j else S_sig[i*M:(i+1)*M, j:(j+1)] for j in range(N)], axis=1) for i in range(N)], axis=0) total = tf.reduce_sum(1-tf.sigmoid(S_correct)+tf.reduce_max(S_sig, axis=1, keep_dims=True))
GE2E Softmax Vs Contrast
GE2E Softmax | GE2E Contrast |
the similarity values of colored areas to be large, and the values of gray areas to be small | the blue embedding vector to be close to its own speaker’s centroid (blue triangle), and far from the others centroids (red and purple triangles), especially the closest one (red triangle) |
slightly better for TI-SV | better for TD-SV |
TI-SV: text-independent speaker verification
TD-SV: text-dependent speaker verification
How to train GE2E
From this paper, we can find some basic settings:
- N = 64, M = 10 utterances per speaker.
- learning rate = 0.01 and decrease it by half every 30M steps.
- The L2-norm of gradient is clipped at 3
- the gradient scale for projection node in LSTM is set to 0.5