This article is written based on the lecture MATH532 by Prof. Hwang at POSTECH.

What is Gradient Descent Algorithm

In machine learning, we usually have to find a way to minimize a traget function (e.g. MSE). How can we find the minimum of the target function if it is too dizzy to find the minimum? Think about the basic calculus which you learned in your freshman. There was a notion called β€œGradient,” which points the dirction of the steepest slope. Its magnitude also means the rate of increase or decrease of the function in the direction. Therefore, gradient is very of use to find minimum or maximum valu in that it points where we have to go to find our destination. Of course, we have to follow the direction where gradient points. This can be written as

wk=wkβˆ’1βˆ’Ξ·β‹…βˆ‡wJ(wkβˆ’1)=wkβˆ’1βˆ’Ξ·β‹…βˆ‡wβˆ‘i=1NL(wkβˆ’1;xi,yi) w_k = w_{k-1} - \eta \cdot \nabla_w J(w_{k-1}) = w_{k-1} - \eta \cdot \nabla_w \sum_{i=1}^{N} L(w_{k-1};x_i,y_i)

where J(w)=βˆ‘i=1NL(yi,f^(w;xi))J(w) = \sum_{i=1}^{N}L(y_i,\hat{f}(w;x_i)) and LL is a loss function.

Limitations

image

https://www.reddit.com/r/ProgrammerHumor/comments/eez8al/gradient_descent/

Simple gradient descent alogrithm has several limitations.

  • If the starting point for gradient descent was chosen inappropriately, the algorithm will only make it approach a local minimum, never reaching the global one.

  • We need to compute βˆ‡wL(wkβˆ’1;xi,yi)\nabla_w L(w_{k-1};x_i,y_i) for all ii to perform one update. Hence, it can be very slow and is intractable for datasets that don’t fit in memory.

Stochastic Gradient

One method to overcome the limitations alluded before is the stochastic gradient descent. It is literally the stochastic approximation of gradient descent optimization.

We select data size BB from all sample size NN. Then we can iteratively calculate parameter ww where the loss function is being minimized. This can reduce the computational burden. Also, there is a fluctuation in the objective function since we select the BB data randomly. This fluctuation enables it to jump to new and potentially better local minima.

wk=wkβˆ’1βˆ’Ξ·β‹…βˆ‡wβˆ‘i=1BL(wkβˆ’1;xi,yi) w_k = w_{k-1} - \eta \cdot \nabla_w \sum_{i=1}^{B} L(w_{k-1};x_i,y_i)

limitation

Momentum Methods

This is the most interesting algoritm in the class today! When we drop the ball on the complex loss function with a proper friction, the ball is likely to pass through the local minima and arrive at global minima. This is becuase as ball moves downward, its potential energy transform to the kinetic energy so that it gains the energy to pass through the local minima. Momentum Methods are made inspired by this physical phenomenon. The name momentum comes from the fact that the negative gradient is a force moving a particle through parameter space. (although it is not exactly the same as that found in classical momentum.)

The algorithm can be expressed with equations as below.

vk=Ξ³vkβˆ’1+Ξ·β‹…βˆ‡wβˆ‘i=1BL(wkβˆ’1;xi,yi) v_k = \gamma v_{k-1} + \eta \cdot \nabla_w \sum_{i=1}^{B} L(w_{k-1};x_i,y_i) wk=wkβˆ’1βˆ’vk=wkβˆ’1βˆ’Ξ·β‹…βˆ‡wβˆ‘i=1BL(wkβˆ’1;xi,yi) w_k = w_{k-1} - v_k \newline= w_{k-1} - \eta \cdot \nabla_w \sum_{i=1}^{B} L(w_{k-1};x_i,y_i)

Here, Ξ³\gamma plays a role as a friction. (Although not exactly same.) It is a method that helps acclerate SGD (Sotchasstic Gradient Descent) in the relevant direction and dempens oscillations. Also, it can hlep to escape local minima.

Nestrov Accelerated gradient

Momentum method adjusts the momentum based on current position wkβˆ’1w_{k-1}. On the other hand, Nestrov Acclerated gradient based on approximated future position wkβˆ’1βˆ’Ξ³wkβˆ’1w_{k-1}-\gamma w_{k-1}.

vk=Ξ³vkβˆ’1+Ξ·β‹…βˆ‡wβˆ‘i=1BL(wkβˆ’1βˆ’Ξ³wkβˆ’1;xi,yi) v_k = \gamma v_{k-1} + \eta \cdot \nabla_w \sum_{i=1}^{B} L(w_{k-1}-\gamma w_{k-1};x_i,y_i) wk=wkβˆ’1βˆ’vk=wkβˆ’1βˆ’Ξ·β‹…βˆ‡wβˆ‘i=1BL(wkβˆ’1βˆ’Ξ³wkβˆ’1;xi,yi) w_k = w_{k-1} - v_k \newline= w_{k-1} - \eta \cdot \nabla_w \sum_{i=1}^{B} L(w_{k-1}-\gamma w_{k-1};x_i,y_i)

Adaptive Method

AdaGrad

As I alluded before, in order to arive at global minimum, not local minimum, we need to have appropriate parameter Ξ·\eta . Too small Ξ·\eta can cause too slow convergence and too small Ξ·\eta can cause too much fluctuation to converge. Also having a constant Ξ·\eta is not suitable in the complex loss function.

AdaGrad (Adaptive Gradient) is the algorithm that djust learning rate automatically. It updates more for infrequent parameter and less for frequent parameter. The algorithm can be written as gkβˆ’1,i:=βˆ‡wJ(wkβˆ’1) g_{k-1,i} := \nabla_w J(w_{k-1})

Gkβˆ’1,i=βˆ‘t=0kβˆ’1∣gt,i∣2 G_{k-1,i} = \sum_{t=0}^{k-1}|g_{t,i}|^2

wk=wkβˆ’1βˆ’Ξ·β‹…βˆ‡wΞ·Gkβˆ’1.i+Ο΅gkβˆ’1,i w_k = w_{k-1} - \eta \cdot \nabla_w \frac{\eta}{\sqrt{G_{k-1.i+\epsilon}}}g_{k-1,i}

As the accumulated gradient gets bigger, the learning rate gets smaller.

RMSProb

AdaGrad has some weaknesses because denominator GG. Since GG increases monotonically, the learning rate becomes shrink and eventually becomes very small so the algorithm can not learn anymore. If we change the summation of gkβˆ’1,ig_{k-1,i} into the average of gkβˆ’1,ig_{k-1,i}, the problem can be solved. RMSProb average previous gg and current gg with the weight 1βˆ’Ξ³1-\gamma and Ξ³\gamma respectively.

gkβˆ’1,i:=βˆ‡wJ(wkβˆ’1) g_{k-1,i} := \nabla_w J(w_{k-1})

E[g2]kβˆ’1=Ξ³E[g2]kβˆ’1+(1βˆ’Ξ³)gkβˆ’12 E[g^2]_{k-1} = \gamma E[g^2] _{k-1} + (1-\gamma) g^2 _{k-1}

where E[g2]kβˆ’1=βˆ‘i=0kβˆ’1(1βˆ’Ξ³)Ξ³igkβˆ’1βˆ’i2 E[g^2]_{k-1} = \sum^{k-1} _{i=0}(1-\gamma)\gamma^i g^2 _{k-1-i}

and wk=wkβˆ’1βˆ’Ξ·β‹…βˆ‡wΞ·E[g2]kβˆ’1+Ο΅gkβˆ’1,i w_k = w_{k-1} - \eta \cdot \nabla_w \frac{\eta}{\sqrt{E[g^2]_{k-1}+\epsilon}}g_{k-1,i}

AdaDelta

AdaDelta is another extension of AdaGrad. AdaDelta not only focuses on the denomiantor but also numerator.

gkβˆ’1,i:=βˆ‡wkJ(wkβˆ’1) g_{k-1,i} := \nabla_{w_k} J(w_{k-1})

E[Ξ”w2]kβˆ’1=Ξ³1E[Ξ”w2]kβˆ’2+(1βˆ’Ξ³1)Ξ”wkβˆ’12 E[\Delta w^2]_{k-1} = \gamma_1 E[\Delta w^2] _{k-2} + (1-\gamma_1) \Delta w^2 _{k-1}

E[g2]kβˆ’1=Ξ³2E[g2]kβˆ’1+(1βˆ’Ξ³2)gkβˆ’12 E[g^2]_{k-1} = \gamma_2 E[g^2] _{k-1} + (1-\gamma_2) g^2 _{k-1}

and wk=wkβˆ’1βˆ’Ξ·β‹…βˆ‡wE[Ξ”w2]kβˆ’1+Ο΅E[g2]kβˆ’1+Ο΅gkβˆ’1,i w_k = w_{k-1} - \eta \cdot \nabla_w \frac{\sqrt{E[\Delta w^2]_{k-1}+\epsilon}}{\sqrt{E[g^2]_{k-1}+\epsilon}}g_{k-1,i}

Adam

Adam is the algorithm that mixes RMSProp and Momentum. It was proposed by KIngma &Ba. It not only keeps the average of past squared gradients like Adadelta and RMSProp, but also keep the average of past gradients. Since inital average are biased towards zero, Adam adjusts initial averages to be unbiased.

gkβˆ’1,i:=βˆ‡wkJ(wkβˆ’1) g_{k-1,i} := \nabla_{w_k} J(w_{k-1})

mkβˆ’1=Ξ³1mkβˆ’2+(1βˆ’Ξ³1)gkβˆ’1 m_{k-1} = \gamma_1 m_{k-2} + (1-\gamma_1)g_{k-1}

vkβˆ’1=Ξ³2vkβˆ’2+(1βˆ’Ξ³2)glβˆ’12 v_{k-1} = \gamma_2 v_{k-2} + (1-\gamma_2)g^2_{l-1}

mkβˆ’1^=mkβˆ’11βˆ’Ξ³1kβˆ’1 \newline \newline \hat{m_{k-1}} = \frac{m_{k-1}}{1-\gamma^{k-1}_1}

vkβˆ’1^=vkβˆ’11βˆ’Ξ³2kβˆ’1 \hat{v_{k-1}} = \frac{v_{k-1}}{1-\gamma^{k-1}_2}

and wk=wkβˆ’1βˆ’Ξ·v^kβˆ’1+Ξ·m^kβˆ’1 w_k = w_{k-1} - \frac{\eta}{\sqrt{\hat{v}_{k-1}}+\eta} \hat{m} _{k-1}

Code

You can get the gradient descent code I wrote from scratch.

https://github.com/MunjungKim/Gradient-Descent-Algorithm

Reference

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. New York: Springer Series in Statistics

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: springer.