Broadly there are two ways:
Call
loss.backward()
on every batch, but only calloptimizer.step()
andoptimizer.zero_grad()
every N batches. Is it the case that the gradients of the N batches are summed up? Hence to maintain the same learning rate per effective batch, we have to divide the learning rate by N?Accumulate loss instead of gradient, and call
(loss / N).backward()
every N batches. This is easy to understand, but does it defeat the purpose of saving memory (because the gradients of the N batches are computed at once)? The learning rate doesn't need adjusting to maintain the same learning rate per effective batch, but should be multiplied by N if you want to maintain the same learning rate per example.
Which one is better, or more commonly used in packages such as pytorch-lightning? It seems that optimizer.zero_grad()
is a prefect fit for gradient accumulation, therefore (1) should be recommended.