1

Broadly there are two ways:

  1. Call loss.backward() on every batch, but only call optimizer.step() and optimizer.zero_grad() every N batches. Is it the case that the gradients of the N batches are summed up? Hence to maintain the same learning rate per effective batch, we have to divide the learning rate by N?

  2. Accumulate loss instead of gradient, and call (loss / N).backward() every N batches. This is easy to understand, but does it defeat the purpose of saving memory (because the gradients of the N batches are computed at once)? The learning rate doesn't need adjusting to maintain the same learning rate per effective batch, but should be multiplied by N if you want to maintain the same learning rate per example.

Which one is better, or more commonly used in packages such as pytorch-lightning? It seems that optimizer.zero_grad() is a prefect fit for gradient accumulation, therefore (1) should be recommended.

4

1 に答える 1