Tokenizing or Splitting Words and Sentences From String Using NLTK – NLTK Tutorial

admin

4 years ago

When we are processing text, we often need to split text content to sentences, then split sentence to words. In this tutorial, we will tell you how to do using python nltk.

Import library

In order to tokenize words or sentences from a text, we need use nltk library.

from nltk import word_tokenize, sent_tokenize

Then we can start to use sent_tokenize() to split a document to sentences and use word_tokenize() to split a sentence to words.

Tokenizing sentences

We will use an example to show you how to do.

s  ='ASP.NET Webs Developerses guide. I likes it understanding'

sx = sent_tokenize(s)
print(sx)

Run this code, this example will output:

['ASP.NET Webs Developerses guide.', 'I likes it understanding']

A python list is outputed, it contains sentences.

Tokenizing words

Similar to tokenize sentences, we can use word_tokenize() to split words from a sentence.

Here is an example:

for se in sx:
    wx = word_tokenize(se)
    print(wx)

Run this code, we will get:

['ASP.NET', 'Webs', 'Developerses', 'guide', '.']
['I', 'likes', 'it', 'understanding']

All words in a sentence are saved in a python list.