Split a String by Multiple String Delimiters in Python – Python Tutorial

By | May 9, 2024

It is easy to split a python string by split() function.

For example:

text = "tutorialexample.com is a popular tutorial site."
x = text.split(".")
print(x)

Here text is splitted by . delimiter, we will get a string list.

['tutorialexample', 'com is a popular tutorial site', '']

However, if we plan to split a python string with multiple string delimiters, how to do?

How to split a python string with multiple string delimiters?

For example:

text ="this is a test<b>test</b>, hi good boy<em>nice</em>me"

and our delimiters is a python list:

d = ["<b>","</b>", "<em>","</em>"]

We can use python regular expression to do.

pattern = "<[/]{0,1}[^/]{1,10}>"
sentences = re.split(f"{pattern}", text)
print(sentences)

In this exmaple, pattern can contains all delimiters in parameter b.

Run this code, we will get:

['this is a test', 'test', ', hi good boy', 'nice', 'me']

However, if you want to display all delimiters in result, you can do as follows:

pattern = "<[/]{0,1}[^/]{1,10}>"
sentences = re.split(f"({pattern})", text)

print(sentences)

Then, we will see:

['this is a test', '<b>', 'test', '</b>', ', hi good boy', '<em>', 'nice', '</em>', 'me']

Moreover, if it is hard to create a regular expression, we can do as follows:

text ="this is a test<b>test</b>, hi good boy<em>nice</em>me"
d = ["<b>","</b>", "<em>","</em>"]
pattern = [x for x in d]
pattern = "|".join(pattern)
print(pattern)
sentences = re.split(f"({pattern})", text)
print(sentences)

Run this code, we will get:

<b>|</b>|<em>|</em>
['this is a test', '<b>', 'test', '</b>', ', hi good boy', '<em>', 'nice', '</em>', 'me']

More examples:

text = "The program calculates f0, energy and duration features from speech wav-file, performs continuous wavelet analysis"
d = ["ro","en", "ure","orm"]
pattern = [x for x in d]
pattern = "|".join(pattern)
print(pattern)
sentences = re.split(f"({pattern})", text)
print(sentences)
sentences = re.split(f"{pattern}", text)
print(sentences)

we will get:

ro|en|ure|orm
['The p', 'ro', 'gram calculates f0, ', 'en', 'ergy and duration feat', 'ure', 's f', 'ro', 'm speech wav-file, perf', 'orm', 's continuous wavelet analysis']
['The p', 'gram calculates f0, ', 'ergy and duration feat', 's f', 'm speech wav-file, perf', 's continuous wavelet analysis']