Create Bert input_ids, input_mask and segment_ids: A Beginner Guide – Bert Tutorial

admin

3 years ago

There are there bert inputs: input_ids, input_mask and segment_ids. In this tutorial, we will introduce how to create them for bert beginners.

How to use bert?

We often use bert model as follows:

# create inputs
input_ids = tf.placeholder(tf.int32, shape=[None, None], name='input_ids')
input_mask = tf.placeholder(tf.int32, shape=[None, None], name='input_masks')
segment_ids = tf.placeholder(tf.int32, shape=[None, None], name='segment_ids')

# init bert model
model = modeling.BertModel(
    config=bert_config,
    is_training=False,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=False)

# load bert model
tvars = tf.trainable_variables()
(assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment)
# get bert output
encoder_last_layer = model.get_sequence_output()

In this code, we will see three inputs:input_ids, input_mask and segment_ids , how to create them?

From code above, we can find the shapes of input_ids, input_mask and segment_ids are [None, None]. It means the shape is batch_size * max_sequence_length.

We can find it in bert source code:

How to create input_ids, input_mask and segment_ids?

You should notice segment_ids = token_type_ids in this tutorial.

As to single sentence

Suppose the maximum sentence length is 10, you plan to input a single sentence to bert.

The sentence: I hate this weather, length = 4

You should add [CLS] and [SEP] to this sentence as follows:

The sentence: [CLS] I hate this weather [SEP], length = 6.

The inputs of bert can be:

Here is a souce code example:

def get_bert_input(text, tokenizer, max_len=512):
   
    cls_token = '[CLS]'
    sep_token = '[SEP]'

    word_piece_list = tokenizer.tokenize(text) 
    if len(word_piece_list) > max_len-2: # notice 
        word_piece_list = word_piece_list[:510]
	word_piece_list.insert(0, cls_token)
	word_piece_list.append(sep_token)
	
    input_ids = tokenizer.convert_tokens_to_ids(word_piece_list) 
    

    input_mask  = []
    for i in range(len(input_id)):
        input_mask.append(1) 
    while len(input_mask ) < max_len:
        input_mask.append(0)

    while len(input_id) < max_len:
        input_ids.append(0)

    segment_ids = [0] * max_len

    return input_ids, input_mask, segment_ids

As two sentences

For example, if the maximum sequence length is 12. There are two sentences:

I hate this weather. and you ?

We will get this result.

Here we have removed symbol . and ?. Of course, you also can keep them.

From above, we can find: We should add [CLS] at the beginning of the first sentence, and add a [SEP] at the last of each sentence.

As to segment_ids, the value of it will be 0 at the first sentence, the second sentence will be 1.

As to three sentences

As to conventional bert model, it only support single and two sentences, however, as to sequence that contains more than two sentences, how to process?

In paper <<Text Summarization with Pretrained Encoders>> we will find a solution:

From above, we can find:

segment_ids = 0 in first sentence

segment_ids = 1 in second sentence

segment_ids = 0 in third sentence

We can guess:

segment_ids = 1 in fourth sentence

You can find more detail in here:

https://github.com/yao8839836/kg-bert/blob/master/run_bert_triple_classifier.py