Improve NLTK Word Lemmatization with Parts-of Speech – NLTK Tutorial

By | August 30, 2019

In prevoius tutorial, we learn how to lemmatize a word in nltk, however, it is not perfect. In this tutorial, we will apply word part-of-speech to improve it.

Preliminaries

Before our tutorial, you should read these basic tutorial.

An introduction to word lemmatization in nltk

Implement Word Lemmatization with NLTK for Beginner – NLTK Tutorial

An introduction to nltk word part-of-speech tagging

A Simple Guide to NLTK Tag Word Parts-of-Speech – NLTK Tutorial

Improve nltk word lemmatization with word part-of-speech

Import libraries

import nltk
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet

Get the type of word part-of-speech

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In this function, we only process noun, verb, adjective and adverb, you can change this function to enhance its functionality.

Get word lemmatization based on word part-of-speech

def lemmatize_sentence(sentence):
    res = []
    lemmatizer = WordNetLemmatizer()
    # get word and its pos
    for word, pos in pos_tag(word_tokenize(sentence)):
        wordnet_pos = get_wordnet_pos(pos) or wordnet.NOUN
        res.append(lemmatizer.lemmatize(word, pos=wordnet_pos))

    return res

The key of this function is:

lemmatizer.lemmatize(word, pos=wordnet_pos)

This function can get word lemmatization based on word part-of-speech.

Print the result

print(lemmatize_sentence('done'))

The result is : do

However, if you do not use word part-of-speech to improve, you will get: done

Leave a Reply