Extract Domain and Subdomain From a URL in Python – Python Web Crawler Tutorial

By | July 13, 2020

Sometimes, we have to crawl all resources in only a site. In that situation, we will have to get domain or subdomain of this site by url. In this tutorial, we will introduce you how to do in python.

Extract Domain and Subdomain From a URL in Python

Preliminary

As to url:

https://www.tutorialexample.com/?s=lstm

https is scheme or protocal.

tutorialexample.com is domain.

www.tutorialexample.com is subdomain.

Then we will use an example to show you how to extract these information from a url in python.

Install python tld package

You can use pip install command to install.

pip install tld

Import library

from tld import get_tld

Create a url

We will extract domain, subdoman and scheme for url below:

https://www.tutorialexample.com/?s=lstm

Extract domain, subdoman and scheme

    res = get_tld(url, as_object=True)
    domain = res.fld
    subdomain = res.subdomain + "."+domain
    params = res.parsed_url
   
    
    print(domain)
    print(subdomain)
    print(params)

From the result, we can find:

domain is tutorialexample.com

subdomain is www.tutorialexample.com

params is:

SplitResult(scheme='https', netloc='www.tutorialexample.com', path='/', query='s=lstm', fragment='')

In order to get scheme, we can do like this:

print(params.scheme)

The scheme is https.

Leave a Reply