When you are using python to crawl some sites, one thing you must do is to extract urls from html text. You can use BeautifulSoup to extract href value, however, in this tutorial, we will introduce how to extract urls by python regular expression, which is much faster than BeautifulSoup.
If all urls are absolute in text, you can read this tutorial to extract urls.
A Simple Guide to Extract URLs From Python String – Python Regular Expression Tutorial
However, not all urls are absolute in all html or text. In that situation, you may find way in above tutorial will not work.
To extract all absolute and relative urls from a html or text, you can refer to this example.
Import library
import re
Create a html text contains absolute and relative urls
text = ''' You can read articles <a href="https://www.tutorialexample.com/remove-english-stop-words-with-nltk-step-by-step-nltk-tutorial/"> <a href="best-practice-to-calculate-cosine-distance-between-two-vectors-in-numpy-numpy-tutorial/"> </a> '''
Replace all ‘ with “
text = text.replace('\'', '"')
Because <a href=’***’> is also valid in html page.
Extract all href values (urls) from text
pattern='href[ ]{0,1}=[ ]{0,1}"([^\"]{0,})"' matcher = re.findall(pattern, text, re.I) print(matcher)
Run this code, you will get urls like:
['https://www.tutorialexample.com/remove-english-stop-words-with-nltk-step-by-step-nltk-tutorial/', 'best-practice-to-calculate-cosine-distance-between-two-vectors-in-numpy-numpy-tutorial/']