Convolution Bank (ConvBank) is proposed in paper: TACOTRON: TOWARDS END–TO-END SPEECH SYNTHESIS. In this tutorial, we will introduce how to implement it using tensorflow.
Convolution Bank (ConvBank)
Convolution Bank is a 1-D convolutional networks, it contains K sets of 1-D convolutional filters, where the k-th set contains \(C_k\) filters of width k (i.e. k = 1; 2; : : : ; K), K = 8 in paper.
Convolution Bank filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The convolution outputs are stacked together and further max pooled along time to increase local invariances.Note that we use a stride of 1 to preserve the original time resolution. Batch normalization (Ioffe & Szegedy, 2015) is used for all convolutional layers
How to implement ConvBank in tensorflow?
We will use an example to show you how to do.
Step 1: convert 1-D convolutional layer with batch normalization
Here is the code:
import tensorflow as tf def conv1d(inputs, kernel_size, channels, activation, is_training, scope): with tf.variable_scope(scope): conv1d_output = tf.layers.conv1d( inputs, filters=channels, kernel_size=kernel_size, activation=activation, padding='same') return tf.layers.batch_normalization(conv1d_output, training=is_training)
Here is some useful resources:
Understand TensorFlow tf.layers.conv1d() with Examples – TensorFlow Tutorial
A Step Guide to Implement Batch Normalization in TensorFlow – TensorFlow Tutorial
Step 2: create convolution bank with max pooling
#inputs: [N, T, C], 3 dim def convbank(inputs, K = 8, is_training = True, scope = 'conv_bank'): with tf.variable_scope(scope): # Convolution bank: concatenate on the last axis to stack channels from all convolutions # conv_outputs = [N, T, k*128] conv_outputs = tf.concat( [conv1d(inputs, k, 128, tf.nn.relu, is_training, 'conv1d_%d' % k) for k in range(1, K + 1)], axis=-1 ) # Maxpooling:[N, T, k*128] maxpool_output = tf.layers.max_pooling1d( conv_outputs, pool_size=2, strides=1, padding='same') return maxpool_output
To understand, you can view tf.layers.max_pooling1d():
Understand tf.layers.max_pooling1d(): Max Pooling Layer for 1D Inputs – TensorFlow Tutorial
From this code, we can find the inputs is rank 3. the maxpool_output is also rank 3, the dimension of last axis is k*128.
We can use code below to test convbank.
w = tf.Variable(tf.glorot_uniform_initializer()([4, 50, 200]), name="w") convbank_maxpool_out = convbank(w, K = 8, is_training = True, scope = 'conv_bank') with tf.Session() as sess: sess.run(tf.global_variables_initializer()) x = sess.run(convbank_maxpool_out) print(x.shape) print(x)
Run this code, we will get:
(4, 50, 1024) [[[ 1.0207694e+00 2.9554680e-01 2.2016469e-01 ... -1.3390291e-01 1.3971468e+00 -4.3614873e-01] [ 1.0207694e+00 2.9967129e-03 8.5925090e-01 ... -4.4449669e-01 -4.6057454e-01 2.5842670e-01] [ 2.5612593e+00 4.2234072e-01 1.5048288e+00 ... 3.2633919e-01 3.8881072e-01 2.5842670e-01] ... [-4.0765771e-01 -4.0997037e-01 3.8271800e-01 ... 1.2461096e-02 1.2668070e-01 3.6472526e-01] [-4.0765771e-01 -4.0997037e-01 3.8271800e-01 ... 1.2461096e-02 -2.2410196e-01 3.6472526e-01] [-4.0765771e-01 -4.0997037e-01 -4.4883779e-01 ... -4.4449669e-01 -2.2410196e-01 3.5127762e-01]] ... [-4.0765771e-01 1.2246618e-01 1.5277933e+00 ... 1.3210013e+00 3.5866559e-02 1.7703870e-01] [-4.0765771e-01 1.2246618e-01 -2.5711954e-03 ... 1.3210013e+00 3.5866559e-02 1.7703870e-01] [-4.0765771e-01 -4.0997037e-01 -2.5711954e-03 ... -4.4449669e-01 -4.6057454e-01 -4.3614873e-01]]]
The inputs w is [4, 50, 200], the final output is [4, 50, 1024], 1024 = 8 * 128 where k = 8.