Best Practice to Python Remove JavaScript and CSS Style Code in Text with Regular Expression

By | October 28, 2019

Removing javascript and css style in a python string is a common operation if you have crawled a web page. In this tutorial, we will introduce how to remove them by python regular expression.

Import library

import re

Create a text contains javascript and css style code

text = ''' 
  this is a script test.
  <Script type="text/javascript">
  alert('test')
  </script>
  test is end.
  <style>
        .MathJax, .MathJax_Message, .MathJax_Preview{
            display: none
        }
    </style>
'''

You can find, there exists some javascript and css style code in variable text.

Build regular expression to remove javascript code

re_script = re.compile('<\s*script[^>]*>.*?<\s*/\s*script\s*>', re.S | re.I)

Build regular expression to remove css style code

css_script = re.compile('<\s*style[^>]*>.*?<\s*/\s*style\s*>', re.S | re.I)

To understand re.I and re.S, you can read this tutorial.

A Beginner’s Guide to Python Regular Expressions Flags – Python Regular Expression Tutorial

Remove javascript and css style code

text = re_script.sub('',text)
text = css_script.sub('',text)

print(text)

Run this python script, you will find they are removed, the result is:

  this is a script test.
  
  test is end.

Leave a Reply