There are there bert inputs: input_ids, input_mask and segment_ids. In this tutorial, we will introduce how to create them for bert beginners.
How to use bert?
We often use bert model as follows:
# create inputs input_ids = tf.placeholder(tf.int32, shape=[None, None], name='input_ids') input_mask = tf.placeholder(tf.int32, shape=[None, None], name='input_masks') segment_ids = tf.placeholder(tf.int32, shape=[None, None], name='segment_ids') # init bert model model = modeling.BertModel( config=bert_config, is_training=False, input_ids=input_ids, input_mask=input_mask, token_type_ids=segment_ids, use_one_hot_embeddings=False) # load bert model tvars = tf.trainable_variables() (assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) tf.train.init_from_checkpoint(init_checkpoint, assignment) # get bert output encoder_last_layer = model.get_sequence_output()
In this code, we will see three inputs:input_ids, input_mask and segment_ids , how to create them?
From code above, we can find the shapes of input_ids, input_mask and segment_ids are [None, None]. It means the shape is batch_size * max_sequence_length.
We can find it in bert source code:
How to create input_ids, input_mask and segment_ids?
You should notice segment_ids = token_type_ids in this tutorial.
As to single sentence
Suppose the maximum sentence length is 10, you plan to input a single sentence to bert.
The sentence: I hate this weather, length = 4
You should add [CLS] and [SEP] to this sentence as follows:
The sentence: [CLS] I hate this weather [SEP], length = 6.
The inputs of bert can be:
Here is a souce code example:
def get_bert_input(text, tokenizer, max_len=512): cls_token = '[CLS]' sep_token = '[SEP]' word_piece_list = tokenizer.tokenize(text) if len(word_piece_list) > max_len-2: # notice word_piece_list = word_piece_list[:510] word_piece_list.insert(0, cls_token) word_piece_list.append(sep_token) input_ids = tokenizer.convert_tokens_to_ids(word_piece_list) input_mask = [] for i in range(len(input_id)): input_mask.append(1) while len(input_mask ) < max_len: input_mask.append(0) while len(input_id) < max_len: input_ids.append(0) segment_ids = [0] * max_len return input_ids, input_mask, segment_ids
As two sentences
For example, if the maximum sequence length is 12. There are two sentences:
I hate this weather. and you ?
We will get this result.
Here we have removed symbol . and ?. Of course, you also can keep them.
From above, we can find: We should add [CLS] at the beginning of the first sentence, and add a [SEP] at the last of each sentence.
As to segment_ids, the value of it will be 0 at the first sentence, the second sentence will be 1.
As to three sentences
As to conventional bert model, it only support single and two sentences, however, as to sequence that contains more than two sentences, how to process?
In paper <<Text Summarization with Pretrained Encoders>> we will find a solution:
From above, we can find:
segment_ids = 0 in first sentence
segment_ids = 1 in second sentence
segment_ids = 0 in third sentence
We can guess:
segment_ids = 1 in fourth sentence
You can find more detail in here:
https://github.com/yao8839836/kg-bert/blob/master/run_bert_triple_classifier.py