Understand transformers.get_cosine_schedule_with_warmup() with Examples

transformers.get_cosine_schedule_with_warmup() creates a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer. In this tutorial, we will use an example to show you how to this function.

Syntax

transformers.get_cosine_schedule_with_warmup() is defined as:

transformers.get_cosine_schedule_with_warmup(optimizer: torch.optim.optimizer.Optimizer, num_warmup_steps: int, num_training_steps: int, num_cycles: float = 0.5, last_epoch: int = - 1)

Here are some important parameters.

optimizer: the pytorch optimizer, such as adam, adamw, sgd et al.

num_warmup_steps: the number of steps for the warmup phase, we should notice it is the number of training step, not epoch.

num_training_steps: the total number of training steps. It is determined by the length of trainable set and batch size.

num_cycles: defalt 0.5.

num_cycles: the number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0 following a half-cosine).

last_epoch: default -1. the index of the last epoch when resuming training.

How to use it?

Here we will create an example to show the effect of transformers.get_cosine_schedule_with_warmup().

import transformers
import torch

if __name__ == "__main__":
    from matplotlib import pyplot as plt

    lr_list = []
    model = [torch.nn.Parameter(torch.randn(2, 2, requires_grad=True))]
    LR = 0.001

    max_epoch = 10
    train_dataset_len = 1000
    batch_size = 64
    train_iters_per_epoch = train_dataset_len // batch_size
    num_training_steps = max_epoch*train_iters_per_epoch
    num_warmup_steps = int(num_training_steps * 0.1)

    optimizer = torch.optim.Adam(model, lr=LR, weight_decay=2e-5)

    scheduler = transformers.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles= 0.5)
    for epoch in range(max_epoch):

        for i in range(train_iters_per_epoch):
            optimizer.zero_grad()
            optimizer.step()
            lr_list.append(optimizer.state_dict()['param_groups'][0]['lr'])
            scheduler.step()
    plt.plot(range(max_epoch*train_iters_per_epoch), lr_list, color='r')
    plt.show()

Here are some important variables.

max_epoch: the total epoch number you plan to train.

train_dataset_len: the trainable data set size. For example 1,000 may be 1,000 sentences.

batch_size = 64.

We can calculate num_warmup_steps and num_training_steps as follows:

    train_iters_per_epoch = train_dataset_len // batch_size
    num_training_steps = max_epoch*train_iters_per_epoch
    num_warmup_steps = int(num_training_steps * 0.1)

We should notice: num_warmup_steps = int(num_training_steps * 0.1), 0.1 is a common used percent.

Moreover, we should notice we should use scheduler.step() in batch training but not epoch training.

Run this code, we will see:

If num_cycles= 2, we will see:

However, there are some other method can create cosine warm-up scheduler. They are:

Implement Warm-up Scheduler in Pytorch – Pytorch Example

Implement Cosine Annealing with Warm up in PyTorch – PyTorch Tutorial