Fix Runtimeerror: Expected to have finished reduction in the prior iteration before starting a new one

When we are using torch.nn.parallel.DistributedDataParallel(), we may get this error: Runtimeerror: Expected to have finished reduction in the prior iteration before starting a new one. In this tutorial, we will introduce you how to fix it.

How to fix this error?

There are two methods to fix.

Method 1: use find_unused_parameters=True

For example:

torch.nn.parallel.DistributedDataParallel(model, device_ids=[self.local_rank], broadcast_buffers=False, find_unused_parameters=True)

Then, this runtimeerror can be fixed.

Method 2: remove all model forward() outputs what not be used when calculating loss.

For example:

model = nn.parallel.DistributedDataParallel(model, device_ids=[self.local_rank], broadcast_buffers=False, find_unused_parameters=True)

y_pred, y_tgt = model(x)
loss = cross_entropy(y_pred)

In this example code, model forward() return two variables: y_pred and y_tgt.

However, only y_pred is used when computing cross entropy loss. y_tgt is not used.

Then, this runtimeerror will occur.

In order to fix this error, we should make model not return y_tgt.