A Simple Guide to Get String Content from HTTP Response Header – Python Web Crawler Tutorial

By | August 10, 2019

When we crawl a web page, we send a http request header then will get a http response header. If we crawl successfully, we will get the content of this web page. In this tutorial, we will introduce how to get the string content of web page.

To get the string content from http response, we shoud:

1.Detect the content-type of this web page

We will get string content of web page which content type is: text/html or text/plain

2.Detect the charset of this web page

To detect the charset of a web page, we can refer to this tutorial.

Python Detect Web Page Content Charset Type – Python Web Crawler Tutorial

3.Decompress the content if the content of this page is compressed.

For example, if the content of this page is compressed by br, we can refer to this tutorial.

Understand Content-Encoding: br and Decompress String – Python Web Crawler Tutorial

Then we can define a function to get the content of web page.

def getcontent(crawl_response):
    #print(crawl_response.getheaders())
    content = crawl_response.read()
    
    encoding = crawl_response.getheader("Content-Encoding")
    #charest and content_type
    message = crawl_response.info()
    content_type = message.get_content_type()
    if content_type != 'text/html':
        pass
    
    if not encoding:
        pass
    
    try:
        if encoding == 'br':
            import brotli
            content = brotli.decompress(content)
            
        if encoding == 'gzip':
            import gzip
            content = gzip.decompress(content)
    except Exception as e:
        print(e)
        
        #charset
    charset = None
    charset = message .get_content_charset(None)
    if not charset:
        charset = message.get_charsets(None)
        if not charset:
            import chardet
            result=chardet.detect(content)
            charset=result['encoding']
        else:
            charset = charset[0]
    if not charset: # default set utf-8
        charset = 'utf-8'    
       # print(content)

       
    content = content.decode(charset)
        
    if charset != 'utf-8':# convert utf-8
        content = content.encode('utf-8', errors='ignore').decode("utf-8")
    return content

Leave a Reply