Implement Post-Net in Text to Speech Using TensorFlow

Post-Net is also called post-network. It has been used in paper: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions and AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss.

It is comprised of 512 filters with shape 5 × 1 with batch normalization, followed by tanh activations on all but the final layer.

Why use Post-Net?

From paper Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions, we can find: without post-net, our model only obtains a MOS score of 4.429 ± 0.071, compared to 4.526 ± 0.066 with it, meaning that empirically the post-net is still an important part of the network design.

How to build post-net in tensorflow?

In this tutorial, we will use an example to show you how to create post-net.

Here is an example code:

class PostNet():
    def output(self, x, filters = 512, trainable = True):
        '''
        :param x:
         [filter_width, in_channels, out_channels]
        :return:  [batch, out_width, out_channels]
        '''
        layers = 4 # 
        for i in range(layers):
            x = tf.layers.conv1d(x, filters = filters, kernel_size = 5, use_bias = True, padding = 'same', name = 'postnet_'+str(i))
            x = tf.layers.batch_normalization(x, axis=-1, training=trainable, name='postnet_bm'+str(i))
            x = tf.tanh(x)
        # final layer
        x = tf.layers.conv1d(x, filters=80, kernel_size=5, use_bias=True, padding='same', name='postnet_5') 
        x = tf.layers.batch_normalization(x, axis=-1, training=trainable, name='postnet_bm_5')

        return x

class PostNet():
def output(self, x, filters = 512, trainable = True):
'''
:param x:
[filter_width, in_channels, out_channels]
:return: [batch, out_width, out_channels]
'''
layers = 4 #
for i in range(layers):
x = tf.layers.conv1d(x, filters = filters, kernel_size = 5, use_bias = True, padding = 'same', name = 'postnet_'+str(i))
x = tf.layers.batch_normalization(x, axis=-1, training=trainable, name='postnet_bm'+str(i))
x = tf.tanh(x)
# final layer
x = tf.layers.conv1d(x, filters=80, kernel_size=5, use_bias=True, padding='same', name='postnet_5')
x = tf.layers.batch_normalization(x, axis=-1, training=trainable, name='postnet_bm_5')
return x

class PostNet():
    def output(self, x, filters = 512, trainable = True):
        '''
        :param x:
         [filter_width, in_channels, out_channels]
        :return:  [batch, out_width, out_channels]
        '''
        layers = 4 # 
        for i in range(layers):
            x = tf.layers.conv1d(x, filters = filters, kernel_size = 5, use_bias = True, padding = 'same', name = 'postnet_'+str(i))
            x = tf.layers.batch_normalization(x, axis=-1, training=trainable, name='postnet_bm'+str(i))
            x = tf.tanh(x)
        # final layer
        x = tf.layers.conv1d(x, filters=80, kernel_size=5, use_bias=True, padding='same', name='postnet_5') 
        x = tf.layers.batch_normalization(x, axis=-1, training=trainable, name='postnet_bm_5')

        return x

Post-Net contains 5 conv1d layers. We can use tf.layers.conv1d() to compute.

Understand TensorFlow tf.layers.conv1d() with Examples – TensorFlow Tutorial

Then, we will use tf.layers.batch_normalization() to implement a bath normalization.

A Step Guide to Implement Batch Normalization in TensorFlow – TensorFlow Tutorial

As to input x, it should be: [batch_size, sequence_length, 80]. Because we will find the x will be 80 on axis = -1 from two papers above.

We can evaluate PostNet layer as follows:

import tensorflow as tf
import numpy as np
inputs = tf.Variable(tf.truncated_normal([15, 512, 80], stddev=0.1), name="inputs")
pnet = PostNet()
out = pnet.output(inputs)

init = tf.global_variables_initializer()
init_local = tf.local_variables_initializer()
with tf.Session() as sess:
    sess.run([init, init_local])
    np.set_printoptions(precision=4, suppress=True)
    a =sess.run(out)
    print(a.shape)

import tensorflow as tf
import numpy as np
inputs = tf.Variable(tf.truncated_normal([15, 512, 80], stddev=0.1), name="inputs")
pnet = PostNet()
out = pnet.output(inputs)
init = tf.global_variables_initializer()
init_local = tf.local_variables_initializer()
with tf.Session() as sess:
sess.run([init, init_local])
np.set_printoptions(precision=4, suppress=True)
a =sess.run(out)
print(a.shape)

import tensorflow as tf
import numpy as np
inputs = tf.Variable(tf.truncated_normal([15, 512, 80], stddev=0.1), name="inputs")
pnet = PostNet()
out = pnet.output(inputs)

init = tf.global_variables_initializer()
init_local = tf.local_variables_initializer()
with tf.Session() as sess:
    sess.run([init, init_local])
    np.set_printoptions(precision=4, suppress=True)
    a =sess.run(out)
    print(a.shape)

Run this code, we will get:

(15, 512, 80)

(15, 512, 80)

(15, 512, 80)

The shape of it is same to input x.

Implement Post-Net in Text to Speech Using TensorFlow – TensorFlow Tutorial

Why use Post-Net?

How to build post-net in tensorflow?

Leave a Reply Cancel reply