Understand Transformers tokenizer.encode() with Examples

tokenizer is widely used in huggingface, especially for tokenizer.encode(). In this tutorial, we will use llama2 tokenizer to show you how to use it.

Load llama2 tokenizer

We can use code below to load llama2 tokenizer

# -*- coding:utf-8 -*-
from transformers import LlamaForCausalLM, LlamaTokenizer

llama_path = r"D:\10_LLM\pretrained\LLM\llama2"
tokenizer = LlamaTokenizer.from_pretrained(llama_path)
print(tokenizer)

Run this code, we will get:

LlamaTokenizer(name_or_path='D:\10_LLM\pretrained\LLM\llama2', vocab_size=55296, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': '<pad>'}, clean_up_tokenization_spaces=False)

In this code, LlamaTokenizer.from_pretrained() will load D:\10_LLM\pretrained\LLM\llama2\tokenizer.model file to create tokenizer.

Then, we can use this tokenizer to convert a string to ids or convert ids to a string.

Convert a string to ids

We can use tokenizer.encode() function.

It is defined as:

encode(text: Union[str, List[str], List[int]], text_pair: Optional[Union[str, List[str], List[int]]] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, return_tensors: Optional[Union[str, transformers.file_utils.TensorType]] = None, **kwargs)

https://huggingface.co/transformers/v4.8.2/main_classes/tokenizer.html

There are some very important parameters we should notice:

add_special_tokens: Whether or not to encode the sequences with the special tokens relative to their model.
padding: ‘max_length’, Pad to a maximum length specified with the argument max_length
max_length: Controls the maximum length to use by one of the truncation/padding parameters.
truncation: Activates and controls truncation.

For example:

doc_list = ["Hello this is a test.", "I love python"]
for d in doc_list:
    inputs = tokenizer.encode(d, max_length=5, truncation=True)
    print(inputs)
    print(tokenizer.decode(inputs))

Run this code, we will see:

[1, 15043, 445, 338, 263]
<s>Hello this is a
[1, 306, 5360, 3017]
<s>I love python

We can find here a special character <s> at the the beginning.

If add_special_tokens = False

doc_list = ["Hello this is a test.", "I love python"]
for d in doc_list:
    inputs = tokenizer.encode(d, max_length=5, truncation=True, add_special_tokens = False)
    print(inputs)
    print(tokenizer.decode(inputs))

We will get:

[15043, 445, 338, 263, 1243]
Hello this is a test
[306, 5360, 3017]
I love python

When add_special_tokens = False, tokenizer.encode() ==tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

if padding= ‘max_length’

It is very useful when we plan to build a batch.

For example:

doc_list = ["Hello this is a test.", "I love python"]
for d in doc_list:
    inputs = tokenizer.encode(d, max_length=5, truncation=True, padding="max_length")
    print(inputs)
    print(tokenizer.decode(inputs))

Then, we will get:

[1, 15043, 445, 338, 263]
<s>Hello this is a
[1, 306, 5360, 3017, 32000]
<s> I love python<pad>

These two inputs are the same length.

Convert ids to string

As example above, we can use tokenizer.decode() to convert a input ids to a string.