transformers.get_cosine_schedule_with_warmup() creates a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer. In this tutorial, we will use an example to show you how to this function.
Syntax
transformers.get_cosine_schedule_with_warmup() is defined as:
transformers.get_cosine_schedule_with_warmup(optimizer: torch.optim.optimizer.Optimizer, num_warmup_steps: int, num_training_steps: int, num_cycles: float = 0.5, last_epoch: int = - 1)
Here are some important parameters.
optimizer: the pytorch optimizer, such as adam, adamw, sgd et al.
num_warmup_steps: the number of steps for the warmup phase, we should notice it is the number of training step, not epoch.
num_training_steps: the total number of training steps. It is determined by the length of trainable set and batch size.
num_cycles: defalt 0.5.
num_cycles: the number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0 following a half-cosine).
last_epoch: default -1. the index of the last epoch when resuming training.
How to use it?
Here we will create an example to show the effect of transformers.get_cosine_schedule_with_warmup().
import transformers import torch if __name__ == "__main__": from matplotlib import pyplot as plt lr_list = [] model = [torch.nn.Parameter(torch.randn(2, 2, requires_grad=True))] LR = 0.001 max_epoch = 10 train_dataset_len = 1000 batch_size = 64 train_iters_per_epoch = train_dataset_len // batch_size num_training_steps = max_epoch*train_iters_per_epoch num_warmup_steps = int(num_training_steps * 0.1) optimizer = torch.optim.Adam(model, lr=LR, weight_decay=2e-5) scheduler = transformers.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles= 0.5) for epoch in range(max_epoch): for i in range(train_iters_per_epoch): optimizer.zero_grad() optimizer.step() lr_list.append(optimizer.state_dict()['param_groups'][0]['lr']) scheduler.step() plt.plot(range(max_epoch*train_iters_per_epoch), lr_list, color='r') plt.show()
Here are some important variables.
max_epoch: the total epoch number you plan to train.
train_dataset_len: the trainable data set size. For example 1,000 may be 1,000 sentences.
batch_size = 64.
We can calculate num_warmup_steps and num_training_steps as follows:
train_iters_per_epoch = train_dataset_len // batch_size num_training_steps = max_epoch*train_iters_per_epoch num_warmup_steps = int(num_training_steps * 0.1)
We should notice: num_warmup_steps = int(num_training_steps * 0.1), 0.1 is a common used percent.
Moreover, we should notice we should use scheduler.step() in batch training but not epoch training.
Run this code, we will see:
If num_cycles= 2, we will see:
However, there are some other method can create cosine warm-up scheduler. They are:
Implement Warm-up Scheduler in Pytorch – Pytorch Example
Implement Cosine Annealing with Warm up in PyTorch – PyTorch Tutorial