To crawl web page content correctly, you must be sure the content charset type of content string. However, there are some types of charsets, such as utf-8, gbk, gb2312 et al. In this tutorial, we will introduce a way to detect the charset type of content string using python.
The importance of detecting content string charset type
If you do not determine the charset type, you may
1.fail to convert a byte string to string
Fix UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 0 – Python Tutorial
2.fail to save a string to file.
How to detect the charset type of web page
One of most basic methods is to extract it from web page source code.
<meta http-equiv="content-type" content="text/html; charset=UTF-8"> <meta data-rh="true" charset="utf-8"/>
Here in html meta tag, there exists charest value of this page.
In this tutorial, we will use http response object and python chardet library to detect string charset.
Preliminaries
Get a http response object: crawl_response
To get this object, you can read this article.
A Simple Guide to Use urllib to Crawl Web Page in Python 3 – Python Web Crawler Tutorial
Get http response message
message = crawl_response.info()
Get content charset
charset = message .get_content_charset(None)
However, this method may fail. So we should detect continuely.
if not charset: charset = message.get_charsets(None) if not charset: #continue else: charset = charset[0]
However, message.get_charsets() also may fail if there is no meta charest tag in html page. At this situation, we will use chardet library to detect.
if not charset: import chardet result=chardet.detect(content) charset=result['encoding']
chardet library can detect the most probably charset by content string. However, it has two questions:
1.Html page is gbk, it may return gb2312, which means it may return a different value if you use message .get_content_charset(None)
2.It also may return None
So we should set charest default value is utf-8.
if not charset: # default set utf-8 charset = 'utf-8
The full python detect code is here.
def detectCharest(crawl_response): charset = None message = crawl_response.info() charset = message .get_content_charset(None) print(charset) if not charset: charset = message.get_charsets(None) if not charset: import chardet result=chardet.detect(content) charset=result['encoding'] else: charset = charset[0] if not charset: # default set utf-8 charset = 'utf-8' return charset