A Simple Guide to Download Files with Python 3.x – Python Web Crawler Tutorial

By | September 15, 2019

In this tutorial, we will introduce on how to download files by python 3.x. Here are some problems you should notice, you can read our tutorial and learn how to download files correctly.

Import libraries

import urllib.request
import urllib.parse
import http.cookiejar
import os
import time
import random
import socket

Set socket default timeout

download_max_time = float(30)
socket.setdefaulttimeout(download_max_time)

Here you should set socket default timeout, in code above, we set it to 30 senconds. If you have not set, urllib.request.urlretrieve()  may waiting for a long time wihout any response.

Best Practice to Set Timeout for Python urllib.request.urlretrieve() – Python Web Crawler Tutorial

Get the host of download url

def getRootURL(url):
    url_info = urllib.parse.urlparse(url)
    #print(url_info)
    host = url_info.scheme+ "://" + url_info.netloc
    return host

Some website may restrict reference.

Create a opener with cookie

def getRequestOpener(url):
    opener = None
    cookie = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
        
    headers = []
    headers.append(('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'))
    headers.append(('Accept-Encoding', 'gzip, deflate, br'))
    headers.append(('Accept-Language', 'zh-CN,zh;q=0.9'))
    headers.append(('Cache-Control', 'max-age=0'))
    headers.append(('Referer', getRootURL(url)))
    headers.append(('User-Agent', 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'))

    opener.addheaders = headers
    
    return opener

Some websites may check cookie.

Install opener

opener = getRequestOpener(url)
urllib.request.install_opener(opener)

Download file from url

    try:
        
        local_file, response_headers = urllib.request.urlretrieve(url,local_filename,None)
        file_content_type = response_headers.get_content_type()
        print(file_content_type)
    except urllib.error.ContentTooShortError as shortError:
        print(shortError)
        print("content too short error")
    except urllib.error.HTTPError as e:
        error_code = e.code
        print(e)
        if error_code >= 403 or error_code >=500: #Not Found
            print("\n")
            print(e)
            print("fail to download!")
    except urllib.error.URLError as ue: # such as timeout
        print("fail to download!")
    except socket.timeout as se:
        print(se)
        print("socket timeout")
    except Exception as ee:
        print(ee)

In this code, you should notice these excepitons and know how to process them when exceptions occur.

Leave a Reply