A Beginner’s Guide to Python Regular Expressions Flags – Python Regular Expression Tutorial

By | October 28, 2019

We often use python regular expression flags in re.search(), re.findall() functions. In this tutorial, we will introduce how to use these regular expression flags for python beginners.

Why we should use regular expressions flags

The key thing you shoud remember is: python regular expressions match string only on single line defaultly withou flag, it can not match string on multi-lines. Meanwhile it is case-sensitive.

In order to make python regular expressions can match string on multi-lines and case-insensitive, we should use regular expression flags.

A common list of python regular expressions flags

Here is some common used flags in python applications.

syntax long syntax meaning
re.I re.IGNORECASE ignore case.
re.M re.MULTILINE make regular expressions can match string on multi-lines.
re.S re.DOTALL make . match newline too.
re.A re.ASCII make {\w, \W, \b, \B, \d, \D, \s, \S} only match ascii characters.
re.X re.VERBOSE allow regular expressions on multi-lines, ignore blank characters and # comments.

Note: you can compose several flags with |, for example: re.I | re.S.

An example with re.I and re.M and re.S

Here we will use python regular expression to remove javascript in a string as a example.

The text is:

import re
text = ''' 
  this is a script test.
  <Script type="text/javascript">
  alert('test')
  </script>
  test is end.
'''

If you want to remove javascript, you can do like this:

Regular expression withou re.I, re.M and re.S

re_script = re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>')
text = re_script.sub('',text)

print(text)

From the result, you will find text is not changed, which means you do not remove javascript. Because:

1.String Script in text, not script

2.<Script type=”text/javascript”> and </script> are not in the same line.

To remove this javascript, we shoud concern:

1.Make regular expression can match string on multi-lines

2.Make regular expression can ignore case-sensitive

Change our regular expression to:

re_script = re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', re.M | re.I)
text = re_script.sub('',text)

print(text)

Run this python script, we will get result:

  this is a script test.
  
  test is end.

Which means javascript is removed.

However, we also can do like this:

re_script = re.compile('<\s*script[^>]*>.*?<\s*/\s*script\s*>', re.S | re.I)
text = re_script.sub('',text)

print(text)

In this code, we use re.S to replace re.M, because re.S can make . match new line. We also can remove javascript by

using this regular expression

An example with re.A

re.A only match ascii characters, here is an example.

import re

p1 = re.compile('\w{1,}', re.A)
p2 = re.compile('\w{1,}')

text = 'https://www.tutorialexample.com是一个博客网站'

r1 = re.findall(p1, text)
r2 = re.findall(p2, text)

print(r1)
print(r2)

In this example, we write two regular expressions with re.A and without re.A. As to p1 (with re.A), it only can match ascii characters. However, as to p2. it will match all characters.

r1 is:

['https', 'www', 'tutorialexample', 'com']

r2 is:

['https', 'www', 'tutorialexample', 'com是一个博客网站']

An example with re.X

re.X can allow us to wrtie an regular expression on multi-lines, here is an example.

import re
p1 = re.compile(r"""\d+  # the integral part
                   \.    # the decimal point
                   \d*  # some fractional digits""", re.X)
p2 = re.compile('\d+\.\d*')

text = '12.12dfa122.232ed.da.34'

r1 = re.findall(p1, text)
print(r1)
r2 = re.findall(p2, text)
print(r2)

In this example, we write two regular expressions on multi-lines and on single line, they are the same. We can find this truth from the result:

['12.12', '122.232']
['12.12', '122.232']

Leave a Reply