Python Calculate the Similarity of Two Sentences with Gensim – Gensim Tutorial

By | October 31, 2019

In previous tutorial, we use python difflib library to compute the similarity of two sentences, here is detail.

Python Calculate the Similarity of Two Sentences – Python Tutorial

However, we also can use python gensim library to compute their similarity, in this tutorial, we will tell you how to do.

python calculate the similarity of two strings

In this example, we will use gensim to load a word2vec trainning model to get word embeddings then calculate the cosine similarity of two sentences.

Import library

import gensim

Load word2vec embeddings file

model = gensim.models.KeyedVectors.load_word2vec_format('yelp-2013-embedding-200d.txt', binary=False)

We can get each word embeddings from word2vec embeddings file in sentence, then we will get the sentence embeddings.

Create two senteces

sen_1 = "i love this book"
sen_2 = 'this book is my favorite'

To compare with python difflib library, we use two some sentences.

How to get sentence embeddings?

In this example, we will average each word embeddings in sentence to get sentence embeddings.

Notice: This is a simple method, but not a good one. Because each word may contribute different semantic in sentence.

Calculate cosine similarity of two sentence

sen_1_words = [w for w in sen_1.split() if w in model.vocab]
sen_2_words = [w for w in sen_2.split() if w in model.vocab]

sim = model.n_similarity(sen_1_words, sen_2_words)
print(sim)

Firstly, we split a sentence into a word list, then compute their cosine similarity. The similarity is:

0.839574928046

As to python difflib library, the similarity is: 0.75. However, 0.75 < 0.839574928046, which means gensim is better than python difflib library.

Meanwhile, if you want to compute the similarity of two words with gensim, you can read this tutorial.

Python Gensim Read Word2Vec Word Embeddings and Compute Word Similarity

Leave a Reply