PyTorch Calculate Similarity Between Chinese Sentences – PyTorch Tutorial

By | June 21, 2023

In this tutorial, we will discuss how to compute the similarity between different chinese sentences using pytorch deep learning model.

Here also more tutorials to tell you how to calculate sentence similarity, they are:

Python Calculate the Similarity of Two Sentences with Gensim – Gensim Tutorial

Python Calculate the Similarity of Two Sentences – Python Tutorial

How to calculate the similarity between chinese sentences?

There are some steps to compute.

Step 1: get the representation of chinese sentence.

In order to get the representation of chinese sentence, we will use GanymedeNil/text2vec-large-chinese model. You can download it here.

https://huggingface.co/GanymedeNil/text2vec-large-chinese

We can use code below to get.

# -*- coding:utf-8 -*
from transformers import BertTokenizer, BertModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained('GanymedeNil/text2vec-large-chinese')


model = BertModel.from_pretrained('GanymedeNil/text2vec-large-chinese')
print(model)
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences,padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    print(model_output)
    print("model_output.last_hidden_state shape = ", model_output.last_hidden_state.shape)
    print("model_output.pooler_output shape = ", model_output.pooler_output.shape)

Run this coe, we will find model_output is BaseModelOutputWithPoolingAndCrossAttentions instance. To understand it, we can read:

Understand BaseModelOutputWithPoolingAndCrossAttentions with Examples – PyTorch Tutorial

In this code, we will get each sentence representation by mean_pooling() function.

Moreover, to learn how to use BertTokenizer, you can read:

Understand transformers.BertTokenizer with Examples – PyTorch Tutorial

We will get:

model_output.last_hidden_state shape =  torch.Size([2, 13, 1024])
model_output.pooler_output shape =  torch.Size([2, 1024])
attention_mask shape =  torch.Size([2, 13])
model_output length =  2
token_embeddings shape= torch.Size([2, 13, 1024])
Sentence embeddings:
tensor([[-0.5050, -0.1925,  0.5590,  ...,  0.8610, -0.7712,  0.7617],
        [-0.6504,  0.1314,  0.5595,  ...,  1.0802, -0.4565,  0.7547]])

Then, we can compute the similarity between them.

Step 2: compute the chinese sentence similarity

The example code is here:

x = sentence_embeddings[0,:].unsqueeze(0)
y = sentence_embeddings[1,:].unsqueeze(0)

cosine = torch.nn.CosineSimilarity(dim=1)
print(cosine(x, y))

Run code, we will get:

tensor([0.9368])

Finally, we can convert this tensor to numpy data.

Convert PyTorch Tensor to NumPy: A Step Guide – PyTorch Tutorial