Before we process text, we should tokenize it. In this tutorial, we will write an example to show how to tokenize words and sentences in nltk.
Preliminaries
from nltk.tokenize import word_tokenize, sent_tokenize
Create a text
text = 'this is a text test. you can edit it for you!'
Tokenize words
word_token = word_tokenize(text) print word_token
Notice: if you find Resource punkt not found, you can fix it.
Fix NLTK Resource punkt not found – NLTK Tutorial
The output is:
['this', 'is', 'a', 'text', 'test', '.', 'you', 'can', 'edit', 'it', 'for', 'you', '!']
Tokenize sentences
sent_token = sent_tokenize(text) print sent_token
The output is:
['this is a text test.', 'you can edit it for you!']
From the outputs, we can find word_tokenize() and sent_tokenize() will return a list, not a numpy.ndarray.