Most of LLMs are decoder-only architectures, which means they are not trained to continue from pad tokens. This strategy may cause wrong outputs when batch inference.
In order to address this issue, we should set padding_side = “left” when using tokenizer.
LLM tokenizer
As to Llama2, we can create a tokenizer as follows:
from transformers import LlamaConfig from transformers import LlamaForCausalLM, LlamaForSequenceClassification, LlamaModel,LlamaTokenizer model_path = r"D:\10_LLM\pretrained\LLM\llama2" if __name__ == "__main__": config = LlamaConfig.from_pretrained(model_path) #print(config) tokenizer_1 = LlamaTokenizer.from_pretrained(model_path) model_inputs_1 = tokenizer_1( ["Hello word", "This is a nice day"], padding=True, return_tensors="pt" ) print(model_inputs_1)
Run this code ,we will see:
{'input_ids': tensor([[ 1, 15043, 1734, 32000, 32000, 32000], [ 1, 910, 338, 263, 7575, 2462]]), 'attention_mask': tensor([[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1]])}
There are two sentences in this example, the length of them is different.
Sentence 1: Hello word
Sentence: This is a nice day
Decode the input ids to string,
x1 = tokenizer_1.decode(model_inputs_1["input_ids"][0]) print(x1) x2 = tokenizer_1.decode(model_inputs_1["input_ids"][1]) print(x2)
we can see:
<s> Hello word<pad><pad><pad> <s>This is a nice day
This is the right padding, which means we will add a padding symbol at the right of a sentence.
What is the left padding?
In order to use left padding, we can use code below:
tokenizer_2 = LlamaTokenizer.from_pretrained(model_path) tokenizer_2.pad_token_id = tokenizer_2.eos_token_id # Most LLMs don't have a pad token by default tokenizer_2.padding_side = "left" model_inputs_2 = tokenizer_2( ["Hello word", "This is a nice day"], padding=True, return_tensors="pt" ) print(model_inputs_2)
Run this code, we will see:
{'input_ids': tensor([[ 2, 2, 2, 1, 15043, 1734], [ 1, 910, 338, 263, 7575, 2462]]), 'attention_mask': tensor([[0, 0, 0, 1, 1, 1], [1, 1, 1, 1, 1, 1]])}
In this code, we have set the padding symbol to tokenizer_2.eos_token, it is not the <pad>
We also can the padded string is:
x1 = tokenizer_2.decode(model_inputs_2["input_ids"][0]) print(x1) x2 = tokenizer_2.decode(model_inputs_2["input_ids"][1]) print(x2)
The output is:
</s></s></s><s>Hello word <s>This is a nice day
However, you also can use other symbol to padding symbol. For example:
tokenizer_2.pad_token_id = 0 # Most LLMs don't have a pad token by default
This will make us see:
<unk><unk><unk><s>Hello word <s>This is a nice day
Why should we use left padding in LLM batch inference?
As to LLM, it will predict the next token based on the current token, if current token is padding symbol, which will make a wong prediction. So, we should use left padding in order to make the current token is not a padding symbol.