We often use python regular expression flags in re.search(), re.findall() functions. In this tutorial, we will introduce how to use these regular expression flags for python beginners.
Why we should use regular expressions flags
The key thing you shoud remember is: python regular expressions match string only on single line defaultly withou flag, it can not match string on multi-lines. Meanwhile it is case-sensitive.
In order to make python regular expressions can match string on multi-lines and case-insensitive, we should use regular expression flags.
A common list of python regular expressions flags
Here is some common used flags in python applications.
syntax | long syntax | meaning |
---|---|---|
re.I | re.IGNORECASE | ignore case. |
re.M | re.MULTILINE | make regular expressions can match string on multi-lines. |
re.S | re.DOTALL | make . match newline too. |
re.A | re.ASCII | make {\w, \W, \b, \B, \d, \D, \s, \S} only match ascii characters. |
re.X | re.VERBOSE | allow regular expressions on multi-lines, ignore blank characters and # comments. |
Note: you can compose several flags with |, for example: re.I | re.S.
An example with re.I and re.M and re.S
Here we will use python regular expression to remove javascript in a string as a example.
The text is:
import re text = ''' this is a script test. <Script type="text/javascript"> alert('test') </script> test is end. '''
If you want to remove javascript, you can do like this:
Regular expression withou re.I, re.M and re.S
re_script = re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>') text = re_script.sub('',text) print(text)
From the result, you will find text is not changed, which means you do not remove javascript. Because:
1.String Script in text, not script
2.<Script type=”text/javascript”> and </script> are not in the same line.
To remove this javascript, we shoud concern:
1.Make regular expression can match string on multi-lines
2.Make regular expression can ignore case-sensitive
Change our regular expression to:
re_script = re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', re.M | re.I) text = re_script.sub('',text) print(text)
Run this python script, we will get result:
this is a script test. test is end.
Which means javascript is removed.
However, we also can do like this:
re_script = re.compile('<\s*script[^>]*>.*?<\s*/\s*script\s*>', re.S | re.I) text = re_script.sub('',text) print(text)
In this code, we use re.S to replace re.M, because re.S can make . match new line. We also can remove javascript by
using this regular expression
An example with re.A
re.A only match ascii characters, here is an example.
import re p1 = re.compile('\w{1,}', re.A) p2 = re.compile('\w{1,}') text = 'https://www.tutorialexample.com是一个博客网站' r1 = re.findall(p1, text) r2 = re.findall(p2, text) print(r1) print(r2)
In this example, we write two regular expressions with re.A and without re.A. As to p1 (with re.A), it only can match ascii characters. However, as to p2. it will match all characters.
r1 is:
['https', 'www', 'tutorialexample', 'com']
r2 is:
['https', 'www', 'tutorialexample', 'com是一个博客网站']
An example with re.X
re.X can allow us to wrtie an regular expression on multi-lines, here is an example.
import re p1 = re.compile(r"""\d+ # the integral part \. # the decimal point \d* # some fractional digits""", re.X) p2 = re.compile('\d+\.\d*') text = '12.12dfa122.232ed.da.34' r1 = re.findall(p1, text) print(r1) r2 = re.findall(p2, text) print(r2)
In this example, we write two regular expressions on multi-lines and on single line, they are the same. We can find this truth from the result:
['12.12', '122.232'] ['12.12', '122.232']