When we are processing text, we often need to split text content to sentences, then split sentence to words. In this tutorial, we will tell you how to do using python nltk.
Import library
In order to tokenize words or sentences from a text, we need use nltk library.
from nltk import word_tokenize, sent_tokenize
Then we can start to use sent_tokenize() to split a document to sentences and use word_tokenize() to split a sentence to words.
Tokenizing sentences
We will use an example to show you how to do.
s ='ASP.NET Webs Developerses guide. I likes it understanding' sx = sent_tokenize(s) print(sx)
Run this code, this example will output:
['ASP.NET Webs Developerses guide.', 'I likes it understanding']
A python list is outputed, it contains sentences.
Tokenizing words
Similar to tokenize sentences, we can use word_tokenize() to split words from a sentence.
Here is an example:
for se in sx: wx = word_tokenize(se) print(wx)
Run this code, we will get:
['ASP.NET', 'Webs', 'Developerses', 'guide', '.'] ['I', 'likes', 'it', 'understanding']
All words in a sentence are saved in a python list.