In prevoius tutorial, we learn how to lemmatize a word in nltk, however, it is not perfect. In this tutorial, we will apply word part-of-speech to improve it.
Preliminaries
Before our tutorial, you should read these basic tutorial.
An introduction to word lemmatization in nltk
Implement Word Lemmatization with NLTK for Beginner – NLTK Tutorial
An introduction to nltk word part-of-speech tagging
A Simple Guide to NLTK Tag Word Parts-of-Speech – NLTK Tutorial
Improve nltk word lemmatization with word part-of-speech
Import libraries
import nltk from nltk.stem import WordNetLemmatizer from nltk import word_tokenize, pos_tag from nltk.corpus import wordnet
Get the type of word part-of-speech
def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: return None
In this function, we only process noun, verb, adjective and adverb, you can change this function to enhance its functionality.
Get word lemmatization based on word part-of-speech
def lemmatize_sentence(sentence): res = [] lemmatizer = WordNetLemmatizer() # get word and its pos for word, pos in pos_tag(word_tokenize(sentence)): wordnet_pos = get_wordnet_pos(pos) or wordnet.NOUN res.append(lemmatizer.lemmatize(word, pos=wordnet_pos)) return res
The key of this function is:
lemmatizer.lemmatize(word, pos=wordnet_pos)
This function can get word lemmatization based on word part-of-speech.
Print the result
print(lemmatize_sentence('done'))
The result is : do
However, if you do not use word part-of-speech to improve, you will get: done