Understand padding_side with Examples in LLM

Most of LLMs are decoder-only architectures, which means they are not trained to continue from pad tokens. This strategy may cause wrong outputs when batch inference.

In order to address this issue, we should set padding_side = “left” when using tokenizer.

LLM tokenizer

As to Llama2, we can create a tokenizer as follows:

from transformers import LlamaConfig 
from transformers import LlamaForCausalLM, LlamaForSequenceClassification, LlamaModel,LlamaTokenizer

model_path = r"D:\10_LLM\pretrained\LLM\llama2"

if __name__ == "__main__":
    config = LlamaConfig.from_pretrained(model_path)
    #print(config)
    tokenizer_1 = LlamaTokenizer.from_pretrained(model_path)
    model_inputs_1 = tokenizer_1(
        ["Hello word", "This is a nice day"], padding=True, return_tensors="pt"
    )
    print(model_inputs_1)

Run this code ,we will see:

{'input_ids': tensor([[    1, 15043,  1734, 32000, 32000, 32000],
        [    1,   910,   338,   263,  7575,  2462]]), 'attention_mask': tensor([[1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1]])}

There are two sentences in this example, the length of them is different.

Sentence 1: Hello word

Sentence: This is a nice day

Decode the input ids to string,

    x1 = tokenizer_1.decode(model_inputs_1["input_ids"][0])
    print(x1)
    x2 = tokenizer_1.decode(model_inputs_1["input_ids"][1])
    print(x2)

we can see:

<s> Hello word<pad><pad><pad>
<s>This is a nice day

This is the right padding, which means we will add a padding symbol at the right of a sentence.

What is the left padding?

In order to use left padding, we can use code below:

    tokenizer_2 = LlamaTokenizer.from_pretrained(model_path)
    tokenizer_2.pad_token_id = tokenizer_2.eos_token_id  # Most LLMs don't have a pad token by default
    tokenizer_2.padding_side = "left"
    model_inputs_2 = tokenizer_2(
       ["Hello word", "This is a nice day"], padding=True, return_tensors="pt"
    )
    print(model_inputs_2)

Run this code, we will see:

{'input_ids': tensor([[    2,     2,     2,     1, 15043,  1734],
        [    1,   910,   338,   263,  7575,  2462]]), 'attention_mask': tensor([[0, 0, 0, 1, 1, 1],
        [1, 1, 1, 1, 1, 1]])}

In this code, we have set the padding symbol to tokenizer_2.eos_token, it is not the <pad>

We also can the padded string is:

    x1 = tokenizer_2.decode(model_inputs_2["input_ids"][0])
    print(x1)
    x2 = tokenizer_2.decode(model_inputs_2["input_ids"][1])
    print(x2)

The output is:

</s></s></s><s>Hello word
<s>This is a nice day

However, you also can use other symbol to padding symbol. For example:

tokenizer_2.pad_token_id = 0  # Most LLMs don't have a pad token by default

This will make us see:

<unk><unk><unk><s>Hello word
<s>This is a nice day

Why should we use left padding in LLM batch inference?

As to LLM, it will predict the next token based on the current token, if current token is padding symbol, which will make a wong prediction. So, we should use left padding in order to make the current token is not a padding symbol.