In this tutorial, we will introduce on how to download files by python 3.x. Here are some problems you should notice, you can read our tutorial and learn how to download files correctly.
Import libraries
import urllib.request import urllib.parse import http.cookiejar import os import time import random import socket
Set socket default timeout
download_max_time = float(30) socket.setdefaulttimeout(download_max_time)
Here you should set socket default timeout, in code above, we set it to 30 senconds. If you have not set, urllib.request.urlretrieve() may waiting for a long time wihout any response.
Best Practice to Set Timeout for Python urllib.request.urlretrieve() – Python Web Crawler Tutorial
Get the host of download url
def getRootURL(url): url_info = urllib.parse.urlparse(url) #print(url_info) host = url_info.scheme+ "://" + url_info.netloc return host
Some website may restrict reference.
Create a opener with cookie
def getRequestOpener(url): opener = None cookie = http.cookiejar.CookieJar() opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie)) headers = [] headers.append(('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8')) headers.append(('Accept-Encoding', 'gzip, deflate, br')) headers.append(('Accept-Language', 'zh-CN,zh;q=0.9')) headers.append(('Cache-Control', 'max-age=0')) headers.append(('Referer', getRootURL(url))) headers.append(('User-Agent', 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)')) opener.addheaders = headers return opener
Some websites may check cookie.
Install opener
opener = getRequestOpener(url) urllib.request.install_opener(opener)
Download file from url
try: local_file, response_headers = urllib.request.urlretrieve(url,local_filename,None) file_content_type = response_headers.get_content_type() print(file_content_type) except urllib.error.ContentTooShortError as shortError: print(shortError) print("content too short error") except urllib.error.HTTPError as e: error_code = e.code print(e) if error_code >= 403 or error_code >=500: #Not Found print("\n") print(e) print("fail to download!") except urllib.error.URLError as ue: # such as timeout print("fail to download!") except socket.timeout as se: print(se) print("socket timeout") except Exception as ee: print(ee)
In this code, you should notice these excepitons and know how to process them when exceptions occur.