A Beginner Guide to Tokenize Words and Sentences with NLTK

Before we process text, we should tokenize it. In this tutorial, we will write an example to show how to tokenize words and sentences in nltk.

Preliminaries

from nltk.tokenize import  word_tokenize, sent_tokenize

text = 'this is a text test. you can edit it for you!'

word_token = word_tokenize(text)
print word_token

Notice: if you find Resource punkt not found, you can fix it.

The output is:

['this', 'is', 'a', 'text', 'test', '.', 'you', 'can', 'edit', 'it', 'for', 'you', '!']

sent_token = sent_tokenize(text)
print sent_token

The output is:

['this is a text test.', 'you can edit it for you!']

From the outputs, we can find word_tokenize() and sent_tokenize() will return a list, not a numpy.ndarray.