When we get the output of a Bert mdole, we may get BaseModelOutputWithPoolingAndCrossAttentions object. In this tutorial, we will discuss it.
For example:
with torch.no_grad(): model_output = model(**encoded_input) print(model_output) print("model_output.last_hidden_state shape = ", model_output.last_hidden_state.shape) print("model_output.pooler_output shape = ", model_output.pooler_output.shape)
We may get:
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1958, -0.2254, 0.5879, ..., 0.5189, -0.9704, 0.7224], [-0.5836, -0.6851, 0.3376, ..., 0.7085, -0.5533, 0.2590], [-0.1957, -0.2260, 0.5873, ..., 0.5188, -0.9710, 0.7216], ..., [-0.1958, -0.2260, 0.5873, ..., 0.5188, -0.9710, 0.7216], [-1.0514, -0.4288, 0.8458, ..., 1.1722, -0.6951, 0.8225], [-0.1958, -0.2254, 0.5879, ..., 0.5189, -0.9704, 0.7224]], [[-0.5236, 0.2747, 0.7207, ..., 0.7099, -0.6590, 0.6492], [-0.9260, 0.0429, -0.1059, ..., 1.0130, 0.2954, 0.5721], [-0.6988, 0.3200, 0.4998, ..., 1.3675, -0.5426, 0.1605], ..., [-0.5236, 0.2747, 0.7207, ..., 0.7099, -0.6590, 0.6492], [-0.1056, -0.1332, -0.0261, ..., 1.3496, -0.6363, 0.5059], [-0.0954, -0.1176, -0.0697, ..., 1.3522, -0.6045, 0.5295]]]), pooler_output=tensor([[-0.7585, -0.1595, 0.4985, ..., -0.2657, -0.0202, 0.3537], [-0.7301, -0.5412, 0.3729, ..., -0.1573, -0.1320, 0.3476]]), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None) model_output.last_hidden_state shape = torch.Size([2, 13, 1024]) model_output.pooler_output shape = torch.Size([2, 1024])
We can find that model(**encoded_input) returns a BaseModelOutputWithPoolingAndCrossAttentions object.
BaseModelOutputWithPoolingAndCrossAttentions
It is defined here: https://huggingface.co/docs/transformers/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithCrossAttentions
It contains five variables.
last_hidden_state: torch.FloatTensor = None pooler_output: torch.FloatTensor = None hidden_states: Optional[Tuple[torch.FloatTensor]] = None past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None attentions: Optional[Tuple[torch.FloatTensor]] = None cross_attentions: Optional[Tuple[torch.FloatTensor]] = None
last_hidden_state and pooler_output are the most important.
last_hidden_state: (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output: (torch.FloatTensor of shape (batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
It means pooler_output = [CLS] output
As to this example, we can find:
model_output.last_hidden_state shape = torch.Size([2, 13, 1024]) model_output.pooler_output shape = torch.Size([2, 1024])
hidden_size is defined in model config.json file. For example:
The content may be:
{ "_name_or_path": "hfl/chinese-lert-large", "architectures": [ "BertModel" ], "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 16, "num_hidden_layers": 24, "pad_token_id": 0, "pooler_fc_size": 1024, "pooler_num_attention_heads": 16, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.26.1", "type_vocab_size": 2, "use_cache": true, "vocab_size": 21128 }