Optimizations in deep learning

Gradient descent (GD) is a simple optimization method used in machine learning. When gradient steps are taken with respect to all m examples at each step, this process is referred to as Batch Gradient Descent.

Typically, the most common approach for updating parameters and minimizing costs is Gradient Descent. In this article, we will explore more advanced optimization methods that can improve the speed of the learning process and potentially result in a better final value for the cost function. Using an effective optimization algorithm can significantly reduce the time needed to achieve satisfactory results, turning what may take days into just a few hours.

This presentation will cover several optimization methods, specifically

Gradient Descent with Momentum
RMSProp
Adam.

Gradient descent with momentum

Gradient descent with momentum is an optimization algorithm designed to enhance the convergence speed of standard gradient descent by integrating a momentum term into the update rule. This method is widely employed in machine learning to effectively minimize loss functions and facilitate the training of deep neural networks.

The fundamental principle of momentum involves the computation of an exponentially weighted average of the gradients for the purpose of updating model weights. By taking into account previous gradients, the gradient descent process achieves a smoother progression, thereby mitigating oscillations during iterations.

In previous article we have discussed how gradient descent is used to update parameters (weights and bias) to achieve minimum cost for m set of training examples. Here α is a learning rate

$$repeat \\\begin{Bmatrix} w=w-\alpha\frac{dJ(w,b)}{dw}\\ b=b-\alpha\frac{dJ(w,b)}{db} \end{Bmatrix}\\ m$$

We update parameters using gradient descent on every $W^{l}$ and $b^{l}$ for every layer l=1,2....L

$\begin{flalign*} &W^{l}=W^{l}-\alpha dW^{l} &\\ & b^{l}=b^{l}-\alpha db^{l} \end{flalign*}$

The momentum update rule for layers $𝑙=1,...,𝐿$ is modified as

First we calculate velocity for each layer then we update weights and bias using this velocity

$v_{dW^{l}}=\beta v_{dW^{l}}+ (1-\beta)dW^{l}$

$\begin{flalign*} &W^{l}=W^{l}-\alpha v_{dW^{l}} \end{flalign*}$

$v_{db^{l}}=\beta v_{db^{l}}+ (1-\beta)db^{l}$

$\begin{flalign*} &b^{l}=b^{l}-\alpha v_{db^{l}} \end{flalign*}$

The larger the momentum parameter $\beta$ the smoother the updates will be because it takes past gradients into account to a greater extent. However, if $\beta$ is set too high, it may overly smooth the updates, which can be detrimental. Common values for $\beta$ is typically range from 0.8 to 0.999. $\beta=0.9$ is often a reasonable default choice.

RMSProp

RMSprop, which stands for Root Mean Squared Propagation, is an optimization algorithm that utilizes a moving average of squared gradients to adjust the learning rate for each parameter individually. This approach enhances both the convergence speed and stability of the model training process, making RMSprop a valuable tool in optimizing machine learning models.

While doing mini batch gradient descent calculate dW and db on current mini-batch iteration and then weights and bias are updated as following

$s_{dW^{l}}=\beta s_{dW^{l}}+ (1-\beta)dW^{2}$

$s_{db^{l}}=\beta s_{db^{l}}+ (1-\beta)db^{2}$

$\begin{flalign*} &W^{l}=W^{l}-\alpha\frac{dW}{\sqrt{s_{dW^{l}}+\epsilon} }\end{flalign*}$

$\begin{flalign*} &b^{l}=b^{l}-\alpha\frac{db}{\sqrt{s_{db^{l}}+\epsilon} }\end{flalign*}$

Adam

Adam is recognized as one of the most effective optimization algorithms employed in the training of neural networks. This algorithm integrates principles from RMSProp and Momentum.

It computes an exponentially weighted average of past gradients, which is recorded in two variables

One without bias correction $v_{dW^{l}}=\beta_1 v_{dW^{l}}+ (1-\beta_1)dW^{l}$
One with bias correction $v_{dW^{l}}^{corrected}=\frac {v_{dW^{l}}}{1-(\beta1)^{t}}$

Additionally, it calculates an exponentially weighted average of the squares of the past gradients, again stored in two variables

One without bias correction $s_{dW^{l}}=\beta_2 s_{dW^{l}}+ (1-\beta_2)(dW^{l})^{2}$
One with bias correction $s_{dW^{l}}^{corrected}=\frac {s_{dW^{l}}}{1-(\beta2)^{t}}$

The algorithm updates model parameters by synthesizing information obtained from these two computations.

$\begin{flalign*} &W^{l}=W^{l}-\alpha\frac{v_{dW^{l}}^{corrected}}{\sqrt{s_{dW^{l}}^{corrected}+\epsilon} }\end{flalign*}$

where:

l is the number of layers
$\beta1$ and $\beta2$ are hyperparameters that control the two exponentially weighted averages.
t counts the number of steps taken of Adam( in mini-batch)
$\alpha$ is the learning rate
$\epsilon$ is a very small number to avoid dividing by zero