A Beginner Guide to Tokenize Words and Sentences with NLTK – NLTK Tutorial

By | July 4, 2019

Before we process text, we should tokenize it. In this tutorial, we will write an example to show how to tokenize words and sentences in nltk.

Preliminaries

from nltk.tokenize import  word_tokenize, sent_tokenize

Create a text

text = 'this is a text test. you can edit it for you!'

Tokenize words

word_token = word_tokenize(text)
print word_token

Notice: if you find Resource punkt not found, you can fix it.

Fix NLTK Resource punkt not found – NLTK Tutorial

The output is:

['this', 'is', 'a', 'text', 'test', '.', 'you', 'can', 'edit', 'it', 'for', 'you', '!']

Tokenize sentences

sent_token = sent_tokenize(text)
print sent_token

The output is:

['this is a text test.', 'you can edit it for you!']

From the outputs, we can find word_tokenize() and sent_tokenize() will return a list, not a numpy.ndarray.

Leave a Reply