Out-Of-Vocabulary (OOV) words is an important problem in NLP, we will introduce how to process words that are out of vocabulary in this tutorial.
Why Out-Of-Vocabulary (OOV) words exists?
We often use word2vec or glove to process documents to create word vector or word embedding. Here are tutorials:
Best Practice to Create Word Embeddings Using Word2Vec – Word2Vec Tutorial
Best Practice to Create Word Embeddings Using GloVe – Deep Learning Tutorial
However, we may ignore some words that appear rarely in documents, which may cause OOV problem.
Meanwhile, we may use some pre-trained word representation file, which may do not contain some words in our data set. It also can cause OOV problem.
How to fix OOV problem?
There are three main ways that often be used in AI application.
Way 1: Ingoring them
Generally, words that are out of vocabulary often appear rarely, the will contribute less to our model. The performance of our model will drop scarcely, it means we can ignore them.
Here is an example:
for sen in sentences: for word in sen.strip().split(): if word not in words_dict: continue
Way 2: Replacing them using <UNK>
We can replace all words that are out of vocabulary by using word <UNK>.
Way 3: Initializing them by a uniform distribution with range [-0.01, 0.01]
Out-Of-Vocabulary (OOV) words can be initialized from a uniform distribution with range [-0.01, 0.01]. We can use this uniform distribution to train our model.