tokenizer is widely used in huggingface, especially for tokenizer.encode(). In this tutorial, we will use llama2 tokenizer to show you how to use it.
Load llama2 tokenizer
We can use code below to load llama2 tokenizer
# -*- coding:utf-8 -*- from transformers import LlamaForCausalLM, LlamaTokenizer llama_path = r"D:\10_LLM\pretrained\LLM\llama2" tokenizer = LlamaTokenizer.from_pretrained(llama_path) print(tokenizer)
Run this code, we will get:
LlamaTokenizer(name_or_path='D:\10_LLM\pretrained\LLM\llama2', vocab_size=55296, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': '<pad>'}, clean_up_tokenization_spaces=False)
In this code, LlamaTokenizer.from_pretrained() will load D:\10_LLM\pretrained\LLM\llama2\tokenizer.model file to create tokenizer.
Then, we can use this tokenizer to convert a string to ids or convert ids to a string.
Convert a string to ids
We can use tokenizer.encode() function.
It is defined as:
encode(text: Union[str, List[str], List[int]], text_pair: Optional[Union[str, List[str], List[int]]] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, return_tensors: Optional[Union[str, transformers.file_utils.TensorType]] = None, **kwargs)
https://huggingface.co/transformers/v4.8.2/main_classes/tokenizer.html
There are some very important parameters we should notice:
- add_special_tokens: Whether or not to encode the sequences with the special tokens relative to their model.
- padding: ‘max_length’, Pad to a maximum length specified with the argument max_length
- max_length: Controls the maximum length to use by one of the truncation/padding parameters.
- truncation: Activates and controls truncation.
For example:
doc_list = ["Hello this is a test.", "I love python"] for d in doc_list: inputs = tokenizer.encode(d, max_length=5, truncation=True) print(inputs) print(tokenizer.decode(inputs))
Run this code, we will see:
[1, 15043, 445, 338, 263] <s>Hello this is a [1, 306, 5360, 3017] <s>I love python
We can find here a special character <s> at the the beginning.
If add_special_tokens = False
doc_list = ["Hello this is a test.", "I love python"] for d in doc_list: inputs = tokenizer.encode(d, max_length=5, truncation=True, add_special_tokens = False) print(inputs) print(tokenizer.decode(inputs))
We will get:
[15043, 445, 338, 263, 1243] Hello this is a test [306, 5360, 3017] I love python
When add_special_tokens = False, tokenizer.encode() ==tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
if padding= ‘max_length’
It is very useful when we plan to build a batch.
For example:
doc_list = ["Hello this is a test.", "I love python"] for d in doc_list: inputs = tokenizer.encode(d, max_length=5, truncation=True, padding="max_length") print(inputs) print(tokenizer.decode(inputs))
Then, we will get:
[1, 15043, 445, 338, 263] <s>Hello this is a [1, 306, 5360, 3017, 32000] <s> I love python<pad>
These two inputs are the same length.
Convert ids to string
As example above, we can use tokenizer.decode() to convert a input ids to a string.