Tutorial Example

Extract Links Href Value (Url) Using Python Regular Expression – Python Regular Expression Tutorial

When you are using python to crawl some sites, one thing you must do is to extract urls from html text. You can use BeautifulSoup to extract href value, however, in this tutorial, we will introduce how to extract urls by python regular expression, which is much faster than BeautifulSoup.

If all urls are absolute in text, you can read this tutorial to extract urls.

A Simple Guide to Extract URLs From Python String – Python Regular Expression Tutorial

However, not all urls are absolute in all html or text. In that situation, you may find way in above tutorial will not work.

To extract all absolute and relative urls from a html or text, you can refer to this example.

Import library

import re

Create a html text contains absolute and relative urls

text = '''
You can read articles <a href="https://www.tutorialexample.com/remove-english-stop-words-with-nltk-step-by-step-nltk-tutorial/">
<a href="best-practice-to-calculate-cosine-distance-between-two-vectors-in-numpy-numpy-tutorial/"> </a>
'''

Replace all ‘ with “

text = text.replace('\'', '"')

Because <a href=’***’> is also valid in html page.

Extract all href values (urls) from text

pattern='href[ ]{0,1}=[ ]{0,1}"([^\"]{0,})"'
matcher = re.findall(pattern, text, re.I)
print(matcher)

Run this code, you will get urls like:

['https://www.tutorialexample.com/remove-english-stop-words-with-nltk-step-by-step-nltk-tutorial/', 'best-practice-to-calculate-cosine-distance-between-two-vectors-in-numpy-numpy-tutorial/']