In this tutorial, we will discuss how to compute the similarity between different chinese sentences using pytorch deep learning model.
Here also more tutorials to tell you how to calculate sentence similarity, they are:
Python Calculate the Similarity of Two Sentences with Gensim – Gensim Tutorial
Python Calculate the Similarity of Two Sentences – Python Tutorial
How to calculate the similarity between chinese sentences?
There are some steps to compute.
Step 1: get the representation of chinese sentence.
In order to get the representation of chinese sentence, we will use GanymedeNil/text2vec-large-chinese model. You can download it here.
https://huggingface.co/GanymedeNil/text2vec-large-chinese
We can use code below to get.
# -*- coding:utf-8 -* from transformers import BertTokenizer, BertModel import torch # Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state # First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Load model from HuggingFace Hub tokenizer = BertTokenizer.from_pretrained('GanymedeNil/text2vec-large-chinese') model = BertModel.from_pretrained('GanymedeNil/text2vec-large-chinese') print(model) sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡'] # Tokenize sentences encoded_input = tokenizer(sentences,padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) print(model_output) print("model_output.last_hidden_state shape = ", model_output.last_hidden_state.shape) print("model_output.pooler_output shape = ", model_output.pooler_output.shape)
Run this coe, we will find model_output is BaseModelOutputWithPoolingAndCrossAttentions instance. To understand it, we can read:
Understand BaseModelOutputWithPoolingAndCrossAttentions with Examples – PyTorch Tutorial
In this code, we will get each sentence representation by mean_pooling() function.
Moreover, to learn how to use BertTokenizer, you can read:
Understand transformers.BertTokenizer with Examples – PyTorch Tutorial
We will get:
model_output.last_hidden_state shape = torch.Size([2, 13, 1024]) model_output.pooler_output shape = torch.Size([2, 1024]) attention_mask shape = torch.Size([2, 13]) model_output length = 2 token_embeddings shape= torch.Size([2, 13, 1024]) Sentence embeddings: tensor([[-0.5050, -0.1925, 0.5590, ..., 0.8610, -0.7712, 0.7617], [-0.6504, 0.1314, 0.5595, ..., 1.0802, -0.4565, 0.7547]])
Then, we can compute the similarity between them.
Step 2: compute the chinese sentence similarity
The example code is here:
x = sentence_embeddings[0,:].unsqueeze(0) y = sentence_embeddings[1,:].unsqueeze(0) cosine = torch.nn.CosineSimilarity(dim=1) print(cosine(x, y))
Run code, we will get:
tensor([0.9368])
Finally, we can convert this tensor to numpy data.
Convert PyTorch Tensor to NumPy: A Step Guide – PyTorch Tutorial