In order to convert text to speech, there are many deep learning model can do. In this tutorial, we will introduce how to use vits model to implement.
VITS model
We can download a pretrained vits model to convert text to speech. You can download one here:
https://huggingface.co/NeuML/ljspeech-vits-onnx
This model is trained using ljspeech, which is a female speaker.
How to convert text to speech using vits model
After we have download a pretrained vits model, we can use code below to convert.
import onnxruntime import soundfile as sf import yaml from ttstokenizer import TTSTokenizer # This example assumes the files have been downloaded locally with open("ljspeech-vits-onnx/config.yaml", "r", encoding="utf-8") as f: config = yaml.safe_load(f) # Create model model = onnxruntime.InferenceSession( "ljspeech-vits-onnx/model.onnx", providers=["CPUExecutionProvider"] ) # Create tokenizer tokenizer = TTSTokenizer(config["token"]["list"]) # Tokenize inputs inputs = tokenizer("Create a nice study space Doing homework on the kitchen counter, living room or dining table may not be that beneficial for your child.") # Generate speech print(inputs) outputs = model.run(None, {"text": inputs}) # Write to file sf.write("out.wav", outputs[0], 22050)
This code will convert text “Create a nice study space Doing homework on the kitchen counter, living room or dining table may not be that beneficial for your child.” to speech and save to out.wav file
We can find it is very easy to do.