Fix Runtimeerror: Expected to have finished reduction in the prior iteration before starting a new one – PyTorch Tutorial

By | June 7, 2023

When we are using torch.nn.parallel.DistributedDataParallel(), we may get this error: Runtimeerror: Expected to have finished reduction in the prior iteration before starting a new one. In this tutorial, we will introduce you how to fix it.

make sure all forward function outputs particeipate in calculating loss

How to fix this error?

There are two methods to fix.

Method 1: use find_unused_parameters=True

For example:

torch.nn.parallel.DistributedDataParallel(model, device_ids=[self.local_rank], broadcast_buffers=False, find_unused_parameters=True)

Then, this runtimeerror can be fixed.

Method 2: remove all model forward() outputs what not be used when calculating loss.

For example:

model = nn.parallel.DistributedDataParallel(model, device_ids=[self.local_rank], broadcast_buffers=False, find_unused_parameters=True)

y_pred, y_tgt = model(x)
loss = cross_entropy(y_pred)

In this example code, model forward() return two variables: y_pred and y_tgt.

However, only y_pred is used when computing cross entropy loss. y_tgt is not used.

Then, this runtimeerror will occur.

In order to fix this error, we should make model not return y_tgt.