In previous tutorial, we use python difflib library to compute the similarity of two sentences, here is detail.
Python Calculate the Similarity of Two Sentences – Python Tutorial
However, we also can use python gensim library to compute their similarity, in this tutorial, we will tell you how to do.
In this example, we will use gensim to load a word2vec trainning model to get word embeddings then calculate the cosine similarity of two sentences.
Import library
import gensim
Load word2vec embeddings file
model = gensim.models.KeyedVectors.load_word2vec_format('yelp-2013-embedding-200d.txt', binary=False)
We can get each word embeddings from word2vec embeddings file in sentence, then we will get the sentence embeddings.
Create two senteces
sen_1 = "i love this book" sen_2 = 'this book is my favorite'
To compare with python difflib library, we use two some sentences.
How to get sentence embeddings?
In this example, we will average each word embeddings in sentence to get sentence embeddings.
Notice: This is a simple method, but not a good one. Because each word may contribute different semantic in sentence.
Calculate cosine similarity of two sentence
sen_1_words = [w for w in sen_1.split() if w in model.vocab] sen_2_words = [w for w in sen_2.split() if w in model.vocab] sim = model.n_similarity(sen_1_words, sen_2_words) print(sim)
Firstly, we split a sentence into a word list, then compute their cosine similarity. The similarity is:
0.839574928046
As to python difflib library, the similarity is: 0.75. However, 0.75 < 0.839574928046, which means gensim is better than python difflib library.
Meanwhile, if you want to compute the similarity of two words with gensim, you can read this tutorial.
Python Gensim Read Word2Vec Word Embeddings and Compute Word Similarity