Get Http Response Headers using Python - Python Web Crawler Tutorial

Getting http response headers can help us fix the errors when we are crawling a site, you can get these headers by your browser.

An Easy Guide to Get HTTP Request Header List for Beginners – Python Web Crawler Tutorial

However, this way is not perfect way for python crawler application. In this tutorial, we will introduce you how to get http response headers using python dynamically.

Preliminaries

# -*- coding:utf-8 -*-
import urllib.request

Create a http request object to open a url

def getRequest(url, post_data= None):
    req = urllib.request.Request(url, data = post_data)

    req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8')
    req.add_header('Accept-Encoding', 'gzip, deflate, br')
    req.add_header('Accept-Language', 'zh-CN,zh;q=0.9')
    req.add_header('Cache-Control', 'max-age=0')
    req.add_header('Referer', 'https://www.google.com/')
    req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
    return req

In this function, we have set up some http reques headers for our python crawler.

Crawl a site url

crawl_url = 'https://www.outlook.com'
crawl_req = getRequest(crawl_url)
crawl_response = urllib.request.urlopen(crawl_req)

Get http response code.

We only get http response headers when response code is 200.

if crawl_response_code == 200:
        headers = crawl_response.getheaders()
        print(headers)

Then http response header are:

[('Content-Type', 'text/html; charset=UTF-8'), ('Link', '<https://www.tutorialexample.com/wp-json/>; rel="https://api.w.org/"'), ('Set-Cookie', 'cookielawinfo-checkbox-necessary=yes; expires=Thu, 18-Jul-2019 02:02:58 GMT; Max-Age=3600; path=/'), ('Set-Cookie', 'cookielawinfo-checkbox-non-necessary=yes; expires=Thu, 18-Jul-2019 02:02:58 GMT; Max-Age=3600; path=/'), ('Transfer-Encoding', 'chunked'), ('Content-Encoding', 'br'), ('Vary', 'Accept-Encoding'), ('Date', 'Thu, 18 Jul 2019 01:02:58 GMT'), ('Server', 'LiteSpeed'), ('Alt-Svc', 'quic=":443"; ma=2592000; v="35,39,43,44"'), ('Connection', 'close')]

Get Content-Encoding header

        encoding = crawl_response.getheader('Content-Encoding')
        print("Content-Encoding="+encoding)

The resutl is:

Content-Encoding=br

To get the content of a web page, you should implement different way to decode web page content by its conent-encoding.

Understand Content-Encoding: br and Decompress String – Python Web Crawler Tutorial

Get Http Response Headers using Python – Python Web Crawler Tutorial

Preliminaries

Create a http request object to open a url

Crawl a site url

Get http response code.

Get Content-Encoding header

Leave a Reply Cancel reply