layer normalization and batch normalization

type

status

date

slug

summary

Why Should You Normalize Inputs in a Neural Network?

the numeric input features could take on values in potentially different ranges.

the neural network takes significantly longer to train because the gradient descent algorithm takes longer to converge
such high values can also propagate through the layers of the network leading to the accumulation of large error gradients that make the training process unstable, called the problem of exploding gradients.

→ Preprocessing techniques such as normalization and standardization transform the input data to be on the same scale

Normalization vs Standardization

normalization: map to 0-1

standardization: transforms the input values such that they follow a distribution with zero mean and unit variance.(is also referred to as normalization)

Need for Batch Normalization

If the inputs to a particular layer change drastically, we can again run into the problem of unstable gradients.

But why?

For each batch in the input dataset, the mini-batch gradient descent algorithm runs its updates. It updates the weights and biases (parameters) of the neural network so as to fit to the distribution seen at the input to the specific layer for the current batch.

Now that the network has learned to fit to the current distribution, if the distribution changes substantially for the next batch, it now has to update the parameters to fit to the new distribution. This slows down the training process.

What is Batch Normalization?

For every neuron (activation) in a particular layer, we can force the pre-activations to have zero mean and unit standard deviation. This can be achieved by subtracting the mean from each of the input features across the mini-batch and dividing by the standard deviation.

Limitations of Batch Normalization

when the batch size is small, the sample mean and sample standard deviation are not representative enough of the actual distribution and the network cannot learn anything meaningful.

As batch normalization depends on batch statistics for normalization, it is less suited for sequence models.