CBHG is an important model in Tacotron, we will use tensorflow to implement it in this tutorial.
CBHG Model
CBHG looks like:
It contains 1-D convolution bank + highway network + bidirectional GRU.
To create 1-D convolution bank, you can view:
Implement Convolution Bank (ConvBank) in TensorFlow – TensorFlow Tutorial
To create a highway network, you can view:
Implement Highway Networks in TensorFlow: A Step Guide – TensorFlow Tutorial
We should notice: there are 4 layers highway networks in CBHG.
In tensorflow, we can use tf.nn.bidirectional_dynamic_rnn(). Here is the tutorial:
How to create CBHG in TensorFlow?
Here is an example code:
https://github.com/zuoxiang95/tacotron-1/blob/master/models/modules.py
The function is:
def cbhg(inputs, input_lengths, is_training, scope, K, projections): with tf.variable_scope(scope): with tf.variable_scope('conv_bank'): # Convolution bank: concatenate on the last axis to stack channels from all convolutions conv_outputs = tf.concat( [conv1d(inputs, k, 128, tf.nn.relu, is_training, 'conv1d_%d' % k) for k in range(1, K+1)], axis=-1 ) # Maxpooling: maxpool_output = tf.layers.max_pooling1d( conv_outputs, pool_size=2, strides=1, padding='same') # Two projection layers: proj1_output = conv1d(maxpool_output, 3, projections[0], tf.nn.relu, is_training, 'proj_1') proj2_output = conv1d(proj1_output, 3, projections[1], None, is_training, 'proj_2') # Residual connection: highway_input = proj2_output + inputs # Handle dimensionality mismatch: if highway_input.shape[2] != 128: highway_input = tf.layers.dense(highway_input, 128) # 4-layer HighwayNet: for i in range(4): highway_input = highwaynet(highway_input, 'highway_%d' % (i+1)) rnn_input = highway_input # Bidirectional RNN outputs, states = tf.nn.bidirectional_dynamic_rnn( GRUCell(128), GRUCell(128), rnn_input, sequence_length=input_lengths, dtype=tf.float32) return tf.concat(outputs, axis=2) # Concat forward and backward
We should notice:
We further pass the processed sequence to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections
The output of max pooling is filled into two projection layers (there is no activation function in the final layer), then be added by residual network.
Look at code below:
# Two projection layers: proj1_output = conv1d(maxpool_output, 3, projections[0], tf.nn.relu, is_training, 'proj_1') proj2_output = conv1d(proj1_output, 3, projections[1], None, is_training, 'proj_2') # Residual connection: highway_input = proj2_output + inputs
As to inputs, the rank of it should be 3.