|
| 1 | +# Deep Learning Optimizers |
| 2 | + |
| 3 | +This repository contains implementations and explanations of various optimization algorithms used in deep learning. Each optimizer is explained with its mathematical equations and includes a small code example using Keras. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | +- [Introduction](#introduction) |
| 7 | +- [Optimizers](#optimizers) |
| 8 | + - [Gradient Descent](#gradient-descent) |
| 9 | + - [Stochastic Gradient Descent (SGD)](#stochastic-gradient-descent-sgd) |
| 10 | + - [Momentum](#momentum) |
| 11 | + - [AdaGrad](#adagrad) |
| 12 | + - [RMSprop](#rmsprop) |
| 13 | + - [Adam](#adam) |
| 14 | +- [Usage](#usage) |
| 15 | + |
| 16 | + |
| 17 | +## Introduction |
| 18 | + |
| 19 | +Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate to reduce the losses. Optimization algorithms help to minimize (or maximize) an objective function by adjusting the weights of the network. |
| 20 | + |
| 21 | +## Optimizers |
| 22 | + |
| 23 | +### Gradient Descent |
| 24 | + |
| 25 | +Gradient Descent is the most basic but most used optimization algorithm. It is an iterative optimization algorithm to find the minimum of a function. |
| 26 | + |
| 27 | +**Mathematical Equation:** |
| 28 | + |
| 29 | +$$ \theta = \theta - \eta \nabla J(\theta) $$ |
| 30 | + |
| 31 | +**Keras Code:** |
| 32 | + |
| 33 | +```python |
| 34 | +from keras.optimizers import SGD |
| 35 | + |
| 36 | +model.compile(optimizer=SGD(learning_rate=0.01), loss='mse') |
| 37 | +``` |
| 38 | + |
| 39 | +### Stochastic Gradient Descent (SGD) |
| 40 | + |
| 41 | +SGD updates the weights for each training example, rather than at the end of each epoch. |
| 42 | + |
| 43 | +**Mathematical Equation:** |
| 44 | + |
| 45 | +$$ \theta = \theta - \eta \nabla J(\theta; x^{(i)}; y^{(i)}) $$ |
| 46 | + |
| 47 | +**Keras Code:** |
| 48 | + |
| 49 | +```python |
| 50 | +from keras.optimizers import SGD |
| 51 | + |
| 52 | +model.compile(optimizer=SGD(learning_rate=0.01), loss='mse') |
| 53 | +``` |
| 54 | + |
| 55 | +### Momentum |
| 56 | + |
| 57 | +Momentum helps accelerate gradients vectors in the right directions, thus leading to faster converging. |
| 58 | + |
| 59 | +**Mathematical Equation:** |
| 60 | + |
| 61 | +$$ v_t = \gamma v_{t-1} + \eta \nabla J(\theta) $$ |
| 62 | +$$ \theta = \theta - v_t $$ |
| 63 | + |
| 64 | +**Keras Code:** |
| 65 | + |
| 66 | +```python |
| 67 | +from keras.optimizers import SGD |
| 68 | + |
| 69 | +model.compile(optimizer=SGD(learning_rate=0.01, momentum=0.9), loss='mse') |
| 70 | +``` |
| 71 | + |
| 72 | +### AdaGrad |
| 73 | + |
| 74 | +AdaGrad adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. |
| 75 | + |
| 76 | +**Mathematical Equation:** |
| 77 | + |
| 78 | +$$\theta = \theta - \frac{\eta}{\sqrt{G_{ii} + \epsilon}} \nabla J(\theta)$$ |
| 79 | + |
| 80 | +**Keras Code:** |
| 81 | + |
| 82 | +```python |
| 83 | +from keras.optimizers import Adagrad |
| 84 | + |
| 85 | +model.compile(optimizer=Adagrad(learning_rate=0.01), loss='mse') |
| 86 | +``` |
| 87 | + |
| 88 | +### RMSprop |
| 89 | + |
| 90 | +RMSprop modifies AdaGrad to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average. |
| 91 | + |
| 92 | +**Mathematical Equation:** |
| 93 | + |
| 94 | +$$ E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g_t^2 $$ |
| 95 | +$$ \theta = \theta - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla J(\theta) $$ |
| 96 | + |
| 97 | +**Keras Code:** |
| 98 | + |
| 99 | +```python |
| 100 | +from keras.optimizers import RMSprop |
| 101 | + |
| 102 | +model.compile(optimizer=RMSprop(learning_rate=0.001), loss='mse') |
| 103 | +``` |
| 104 | + |
| 105 | +### Adam |
| 106 | + |
| 107 | +Adam combines the advantages of two other extensions of SGD: AdaGrad and RMSprop. |
| 108 | + |
| 109 | +**Mathematical Equation:** |
| 110 | + |
| 111 | +$$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ |
| 112 | +$$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ |
| 113 | +$$ \hat{m_t} = \frac{m_t}{1 - \beta_1^t} $$ |
| 114 | +$$ \hat{v_t} = \frac{v_t}{1 - \beta_2^t} $$ |
| 115 | +$$ \theta = \theta - \eta \frac{\hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon} $$ |
| 116 | + |
| 117 | +**Keras Code:** |
| 118 | + |
| 119 | +```python |
| 120 | +from keras.optimizers import Adam |
| 121 | + |
| 122 | +model.compile(optimizer=Adam(learning_rate=0.001), loss='mse') |
| 123 | +``` |
| 124 | + |
| 125 | +## Usage |
| 126 | + |
| 127 | +To use these optimizers, simply include the relevant Keras code snippet in your model compilation step. For example: |
| 128 | + |
| 129 | +```python |
| 130 | +model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy']) |
| 131 | +model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test)) |
| 132 | +``` |
| 133 | + |
0 commit comments