Implement Chinese Text Normalization for TTS and ASR – Python Speech Processing

By | June 28, 2022

We are building a chinese TTS or ASR system, we should use text normalization. In this tutorial, we will introduce how to do.

Text Normalization

Text normalization is easy to understand.

For example, sentence “12.5元” should be convert to “十二点五元“. Then TTS can convert chinese pinyin to audio correctly.

In order to get the chinese pinyin of a chinese sentence, you can read:

Python Convert Chinese String to Pinyin: A Step Guide – Python Tutorial

If you want to get chinese phonemes, you can view:

Extract Mandarin Chinese Phonemes in TTS – TTS Tutorial

How to implement chinese text normalizaiton?

We can use this open source code in here:

https://github.com/speechio/chinese_text_normalization

Here is an example code:

Step 1: import package

from cn_tn import TextNorm

Step 2: create some chinese sentences

lines = ["AI技术的发展","你说什么啊!","这块黄金重达324.75克。","随便来几个价格12块5,34.5元,20.1万","有62%的概率。"]

Step 3: create a normalizer to implement text normalization.

normalizer = TextNorm(
        to_banjiao = True,
        to_upper = True,
        to_lower = False,
        remove_fillers = False,
        remove_erhua = False,
        check_chars = False,
        remove_space = True,
    )

Step 4: start to normalize

for line in lines:
    line_norm = normalizer(line)
    print(line,"-->",line_norm)

Run this code, we will see:

AI技术的发展 --> AI技术的发展
你说什么啊! --> 你说什么啊
这块黄金重达324.75克。 --> 这块黄金重达三百二十四点七五克
随便来几个价格12块5,34.5元,20.1万 --> 随便来几个价格十二块五三十四点五元二十点一万
有62%的概率。 --> 有百分之六十二的概率

Parameters in TextNorm

We can find the parameters in cn_tn.py, they are:

  • to_banjiao: convert quanjiao chars to banjiao
  • to_upper: convert to upper case
  • to_lower: convert to lower case
  • remove_fillers: remove filler chars such as “呃, 啊”
  • remove_erhua: remove erhua chars such as “他女儿在那边儿 -> 他女儿在那边”
  • check_chars: skip sentences containing illegal chars
  • remove_space: remove whitespace

We also notice: TextNorm will remove all chinese punctuation.

non-stop puncs

'"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'

stop puncs

'!?。。'

Leave a Reply