Sometimes, we have to crawl all resources in only a site. In that situation, we will have to get domain or subdomain of this site by url. In this tutorial, we will introduce you how to do in python.
Preliminary
As to url:
https://www.tutorialexample.com/?s=lstm
https is scheme or protocal.
tutorialexample.com is domain.
www.tutorialexample.com is subdomain.
Then we will use an example to show you how to extract these information from a url in python.
Install python tld package
You can use pip install command to install.
pip install tld
Import library
from tld import get_tld
Create a url
We will extract domain, subdoman and scheme for url below:
https://www.tutorialexample.com/?s=lstm
Extract domain, subdoman and scheme
res = get_tld(url, as_object=True) domain = res.fld subdomain = res.subdomain + "."+domain params = res.parsed_url print(domain) print(subdomain) print(params)
From the result, we can find:
domain is tutorialexample.com
subdomain is www.tutorialexample.com
params is:
SplitResult(scheme='https', netloc='www.tutorialexample.com', path='/', query='s=lstm', fragment='')
In order to get scheme, we can do like this:
print(params.scheme)
The scheme is https.