A Simple Guide to Extract URLs From Python String - Python Regular Expression Tutorial

Extracting all urls from a python string is often used in nlp filed, which can help us to crawl web pages easily. In this tutorial, we will introduce how to extract urls from a python string.

Preliminaries

import re

Create a python string

text = 'You can read this article <a href="https://www.tutorialexample.com/remove-english-stop-words-with-nltk-step-by-step-nltk-tutorial/"> in https://www.tutorialexample.com'

You also can read a python string from a file or url.

Create a regx to extract urls

urls = re.findall(r'(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?', text)

Output

[('https', 'www.tutorialexample.com', '/remove-english-stop-words-with-nltk-step-by-step-nltk-tutorial/'), ('https', 'www.tutorialexample.com', '')]

However, if you get some relative urls like:

['http://browsehappy.com/', '#content', '#python-network', '/', '/psf-landing/', 'https://docs.python.org']

How to convert these relative urls to absolute urls?

Convert Relative URL to Absolute URL in Python – Python Tutorial

A Simple Guide to Extract URLs From Python String – Python Regular Expression Tutorial

Preliminaries

Create a python string

Create a regx to extract urls

Output

Leave a Reply Cancel reply