When we crawl a web page, we send a http request header then will get a http response header. If we crawl successfully, we will get the content of this web page. In this tutorial, we will introduce how to get the string content of web page.
To get the string content from http response, we shoud:
1.Detect the content-type of this web page
We will get string content of web page which content type is: text/html or text/plain
2.Detect the charset of this web page
To detect the charset of a web page, we can refer to this tutorial.
Python Detect Web Page Content Charset Type – Python Web Crawler Tutorial
3.Decompress the content if the content of this page is compressed.
For example, if the content of this page is compressed by br, we can refer to this tutorial.
Understand Content-Encoding: br and Decompress String – Python Web Crawler Tutorial
Then we can define a function to get the content of web page.
def getcontent(crawl_response): #print(crawl_response.getheaders()) content = crawl_response.read() encoding = crawl_response.getheader("Content-Encoding") #charest and content_type message = crawl_response.info() content_type = message.get_content_type() if content_type != 'text/html': pass if not encoding: pass try: if encoding == 'br': import brotli content = brotli.decompress(content) if encoding == 'gzip': import gzip content = gzip.decompress(content) except Exception as e: print(e) #charset charset = None charset = message .get_content_charset(None) if not charset: charset = message.get_charsets(None) if not charset: import chardet result=chardet.detect(content) charset=result['encoding'] else: charset = charset[0] if not charset: # default set utf-8 charset = 'utf-8' # print(content) content = content.decode(charset) if charset != 'utf-8':# convert utf-8 content = content.encode('utf-8', errors='ignore').decode("utf-8") return content