Getting http response headers can help us fix the errors when we are crawling a site, you can get these headers by your browser.
An Easy Guide to Get HTTP Request Header List for Beginners – Python Web Crawler Tutorial
However, this way is not perfect way for python crawler application. In this tutorial, we will introduce you how to get http response headers using python dynamically.
Preliminaries
# -*- coding:utf-8 -*- import urllib.request
Create a http request object to open a url
def getRequest(url, post_data= None): req = urllib.request.Request(url, data = post_data) req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8') req.add_header('Accept-Encoding', 'gzip, deflate, br') req.add_header('Accept-Language', 'zh-CN,zh;q=0.9') req.add_header('Cache-Control', 'max-age=0') req.add_header('Referer', 'https://www.google.com/') req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36') return req
In this function, we have set up some http reques headers for our python crawler.
Crawl a site url
crawl_url = 'https://www.outlook.com' crawl_req = getRequest(crawl_url) crawl_response = urllib.request.urlopen(crawl_req)
Get http response code.
We only get http response headers when response code is 200.
if crawl_response_code == 200: headers = crawl_response.getheaders() print(headers)
Then http response header are:
[('Content-Type', 'text/html; charset=UTF-8'), ('Link', '<https://www.tutorialexample.com/wp-json/>; rel="https://api.w.org/"'), ('Set-Cookie', 'cookielawinfo-checkbox-necessary=yes; expires=Thu, 18-Jul-2019 02:02:58 GMT; Max-Age=3600; path=/'), ('Set-Cookie', 'cookielawinfo-checkbox-non-necessary=yes; expires=Thu, 18-Jul-2019 02:02:58 GMT; Max-Age=3600; path=/'), ('Transfer-Encoding', 'chunked'), ('Content-Encoding', 'br'), ('Vary', 'Accept-Encoding'), ('Date', 'Thu, 18 Jul 2019 01:02:58 GMT'), ('Server', 'LiteSpeed'), ('Alt-Svc', 'quic=":443"; ma=2592000; v="35,39,43,44"'), ('Connection', 'close')]
Get Content-Encoding header
encoding = crawl_response.getheader('Content-Encoding') print("Content-Encoding="+encoding)
The resutl is:
Content-Encoding=br
To get the content of a web page, you should implement different way to decode web page content by its conent-encoding.
Understand Content-Encoding: br and Decompress String – Python Web Crawler Tutorial