Removing javascript and css style in a python string is a common operation if you have crawled a web page. In this tutorial, we will introduce how to remove them by python regular expression.
Import library
import re
Create a text contains javascript and css style code
text = ''' this is a script test. <Script type="text/javascript"> alert('test') </script> test is end. <style> .MathJax, .MathJax_Message, .MathJax_Preview{ display: none } </style> '''
You can find, there exists some javascript and css style code in variable text.
Build regular expression to remove javascript code
re_script = re.compile('<\s*script[^>]*>.*?<\s*/\s*script\s*>', re.S | re.I)
Build regular expression to remove css style code
css_script = re.compile('<\s*style[^>]*>.*?<\s*/\s*style\s*>', re.S | re.I)
To understand re.I and re.S, you can read this tutorial.
A Beginner’s Guide to Python Regular Expressions Flags – Python Regular Expression Tutorial
Remove javascript and css style code
text = re_script.sub('',text) text = css_script.sub('',text) print(text)
Run this python script, you will find they are removed, the result is:
this is a script test. test is end.