To crawl web page using python, you should know what is http request header. In this tutorial, we simply introduce it and you can learn and set them in your python application.
What is http request header?
Generally speaking, http requestion header are some messages which are sent to web servers. Web servers will check them and implement different process.
For example, some web severs will check the user-agent header, if your application does not send it to server, the server may refuse your request and you will not get web page data.
What headers we shoud use?
The simple way to know what http request header you can use is to open your browser. and press F12, then open a site, such as google.com.
You will find some http request header in your browser.
Here we list some common used headers.
Name | Value |
Accept | text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 |
Accept-Encoding | gzip, deflate, br |
Accept-Language | en-US |
Cache-Control | no-cache |
Cookie | get and save it |
Host | such as tutorialexample.com |
Referer | such as https://www.tutorialexample.com |
User-Agent | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 |