If you plan to create a python website spider, you have to extract urls from page content or xml sitemap. In this tutorial, we will introduce how to extract these urls for your website spider.
1.Extract urls from page content
Page content is a string, we can extract urls from this page string. Here is a tutorial.
A Simple Guide to Extract URLs From Python String – Python Regular Expression Tutorial
2.Extract urls from xml sitemap
We often use xml sitemap file to manage our website urls, which is a good way to submit our website links to google webmaster tool. To spider these urls, we can parse this xml sitemap file and get urls.
A xml sitemap file may like:
To parse it, we can do by steps below.
Import xml parser library
We use python xml.dom.minidom package to parse xml sitemap file.
from xml.dom.minidom import parse import xml.dom.minidom
Load xml sitemap file
We need use xml.dom.minidom to open a xml file to start to parse.
xml_file = r'sitemap/post.xml' DOMTree = xml.dom.minidom.parse(xml_file)
Get the root node in xml file
We should get the root node of this xml file first, then we can get child nodes easily.
root_node = DOMTree.documentElement print(root_node.nodeName)
The root node of xml sitemap is: urlset
Get all urls in xml sitemap
We can get urls in loc nodes by root node. Here is an example.
loc_nodes = root_node.getElementsByTagName("loc") for loc in loc_nodes: print(loc.childNodes[0].data)
Notice: we should use loc.childNodes[0].data to show url, because text in loc node is also a text node.