We are building a chinese TTS or ASR system, we should use text normalization. In this tutorial, we will introduce how to do.
Text Normalization
Text normalization is easy to understand.
For example, sentence “12.5元” should be convert to “十二点五元“. Then TTS can convert chinese pinyin to audio correctly.
In order to get the chinese pinyin of a chinese sentence, you can read:
Python Convert Chinese String to Pinyin: A Step Guide – Python Tutorial
If you want to get chinese phonemes, you can view:
Extract Mandarin Chinese Phonemes in TTS – TTS Tutorial
How to implement chinese text normalizaiton?
We can use this open source code in here:
https://github.com/speechio/chinese_text_normalization
Here is an example code:
Step 1: import package
from cn_tn import TextNorm
Step 2: create some chinese sentences
lines = ["AI技术的发展","你说什么啊!","这块黄金重达324.75克。","随便来几个价格12块5,34.5元,20.1万","有62%的概率。"]
Step 3: create a normalizer to implement text normalization.
normalizer = TextNorm( to_banjiao = True, to_upper = True, to_lower = False, remove_fillers = False, remove_erhua = False, check_chars = False, remove_space = True, )
Step 4: start to normalize
for line in lines: line_norm = normalizer(line) print(line,"-->",line_norm)
Run this code, we will see:
AI技术的发展 --> AI技术的发展 你说什么啊! --> 你说什么啊 这块黄金重达324.75克。 --> 这块黄金重达三百二十四点七五克 随便来几个价格12块5,34.5元,20.1万 --> 随便来几个价格十二块五三十四点五元二十点一万 有62%的概率。 --> 有百分之六十二的概率
Parameters in TextNorm
We can find the parameters in cn_tn.py, they are:
- to_banjiao: convert quanjiao chars to banjiao
- to_upper: convert to upper case
- to_lower: convert to lower case
- remove_fillers: remove filler chars such as “呃, 啊”
- remove_erhua: remove erhua chars such as “他女儿在那边儿 -> 他女儿在那边”
- check_chars: skip sentences containing illegal chars
- remove_space: remove whitespace
We also notice: TextNorm will remove all chinese punctuation.
non-stop puncs
'"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
stop puncs
'!?。。'