Convert Text to Speech in Python Using VITS Model – Python Tutorial

admin

1 year ago

In order to convert text to speech, there are many deep learning model can do. In this tutorial, we will introduce how to use vits model to implement.

VITS model

We can download a pretrained vits model to convert text to speech. You can download one here:

https://huggingface.co/NeuML/ljspeech-vits-onnx

This model is trained using ljspeech, which is a female speaker.

How to convert text to speech using vits model

After we have download a pretrained vits model, we can use code below to convert.

import onnxruntime
import soundfile as sf
import yaml

from ttstokenizer import TTSTokenizer

# This example assumes the files have been downloaded locally
with open("ljspeech-vits-onnx/config.yaml", "r", encoding="utf-8") as f:
    config = yaml.safe_load(f)

# Create model
model = onnxruntime.InferenceSession(
    "ljspeech-vits-onnx/model.onnx",
    providers=["CPUExecutionProvider"]
)

# Create tokenizer
tokenizer = TTSTokenizer(config["token"]["list"])

# Tokenize inputs
inputs = tokenizer("Create a nice study space Doing homework on the kitchen counter, living room or dining table may not be that beneficial for your child.")
# Generate speech
print(inputs)
outputs = model.run(None, {"text": inputs})

# Write to file
sf.write("out.wav", outputs[0], 22050)

This code will convert text “Create a nice study space Doing homework on the kitchen counter, living room or dining table may not be that beneficial for your child.” to speech and save to out.wav file

We can find it is very easy to do.