From 03fb6efedfee9a456107cf9972e140f242e32e1c Mon Sep 17 00:00:00 2001 From: shantnu <98252196+Shantnu-singh@users.noreply.github.com> Date: Mon, 29 Jul 2024 11:17:32 +0530 Subject: [PATCH 1/2] Added optimizers in ANN --- .../Introduction.md | 133 ++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 docs/Deep Learning/Optimizers in Deep Learning/Introduction.md diff --git a/docs/Deep Learning/Optimizers in Deep Learning/Introduction.md b/docs/Deep Learning/Optimizers in Deep Learning/Introduction.md new file mode 100644 index 000000000..cc6b8e268 --- /dev/null +++ b/docs/Deep Learning/Optimizers in Deep Learning/Introduction.md @@ -0,0 +1,133 @@ +# Deep Learning Optimizers + +This repository contains implementations and explanations of various optimization algorithms used in deep learning. Each optimizer is explained with its mathematical equations and includes a small code example using Keras. + +## Table of Contents +- [Introduction](#introduction) +- [Optimizers](#optimizers) + - [Gradient Descent](#gradient-descent) + - [Stochastic Gradient Descent (SGD)](#stochastic-gradient-descent-sgd) + - [Momentum](#momentum) + - [AdaGrad](#adagrad) + - [RMSprop](#rmsprop) + - [Adam](#adam) +- [Usage](#usage) + + +## Introduction + +Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate to reduce the losses. Optimization algorithms help to minimize (or maximize) an objective function by adjusting the weights of the network. + +## Optimizers + +### Gradient Descent + +Gradient Descent is the most basic but most used optimization algorithm. It is an iterative optimization algorithm to find the minimum of a function. + +**Mathematical Equation:** + +$$ \theta = \theta - \eta \nabla J(\theta) $$ + +**Keras Code:** + +```python +from keras.optimizers import SGD + +model.compile(optimizer=SGD(learning_rate=0.01), loss='mse') +``` + +### Stochastic Gradient Descent (SGD) + +SGD updates the weights for each training example, rather than at the end of each epoch. + +**Mathematical Equation:** + +$$ \theta = \theta - \eta \nabla J(\theta; x^{(i)}; y^{(i)}) $$ + +**Keras Code:** + +```python +from keras.optimizers import SGD + +model.compile(optimizer=SGD(learning_rate=0.01), loss='mse') +``` + +### Momentum + +Momentum helps accelerate gradients vectors in the right directions, thus leading to faster converging. + +**Mathematical Equation:** + +$$ v_t = \gamma v_{t-1} + \eta \nabla J(\theta) $$ +$$ \theta = \theta - v_t $$ + +**Keras Code:** + +```python +from keras.optimizers import SGD + +model.compile(optimizer=SGD(learning_rate=0.01, momentum=0.9), loss='mse') +``` + +### AdaGrad + +AdaGrad adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. + +**Mathematical Equation:** + +$$\theta = \theta - \frac{\eta}{\sqrt{G_{ii} + \epsilon}} \nabla J(\theta)$$ + +**Keras Code:** + +```python +from keras.optimizers import Adagrad + +model.compile(optimizer=Adagrad(learning_rate=0.01), loss='mse') +``` + +### RMSprop + +RMSprop modifies AdaGrad to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average. + +**Mathematical Equation:** + +$$ E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g_t^2 $$ +$$ \theta = \theta - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla J(\theta) $$ + +**Keras Code:** + +```python +from keras.optimizers import RMSprop + +model.compile(optimizer=RMSprop(learning_rate=0.001), loss='mse') +``` + +### Adam + +Adam combines the advantages of two other extensions of SGD: AdaGrad and RMSprop. + +**Mathematical Equation:** + +$$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ +$$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ +$$ \hat{m_t} = \frac{m_t}{1 - \beta_1^t} $$ +$$ \hat{v_t} = \frac{v_t}{1 - \beta_2^t} $$ +$$ \theta = \theta - \eta \frac{\hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon} $$ + +**Keras Code:** + +```python +from keras.optimizers import Adam + +model.compile(optimizer=Adam(learning_rate=0.001), loss='mse') +``` + +## Usage + +To use these optimizers, simply include the relevant Keras code snippet in your model compilation step. For example: + +```python +model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy']) +model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test)) +``` + From 6dfae53accec7c522edd0a67d358e1ff7ec18eb7 Mon Sep 17 00:00:00 2001 From: shantnu <98252196+Shantnu-singh@users.noreply.github.com> Date: Mon, 29 Jul 2024 11:23:19 +0530 Subject: [PATCH 2/2] Update Introduction.md for equation rendering --- .../Optimizers in Deep Learning/Introduction.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/Deep Learning/Optimizers in Deep Learning/Introduction.md b/docs/Deep Learning/Optimizers in Deep Learning/Introduction.md index cc6b8e268..57e1a49ce 100644 --- a/docs/Deep Learning/Optimizers in Deep Learning/Introduction.md +++ b/docs/Deep Learning/Optimizers in Deep Learning/Introduction.md @@ -42,7 +42,7 @@ SGD updates the weights for each training example, rather than at the end of eac **Mathematical Equation:** -$$ \theta = \theta - \eta \nabla J(\theta; x^{(i)}; y^{(i)}) $$ +$$\theta = \theta - \eta \nabla J(\theta; x^{(i)}; y^{(i)})$$ **Keras Code:** @@ -75,7 +75,7 @@ AdaGrad adapts the learning rate to the parameters, performing larger updates fo **Mathematical Equation:** -$$\theta = \theta - \frac{\eta}{\sqrt{G_{ii} + \epsilon}} \nabla J(\theta)$$ +$$ \theta = \theta - \frac{\eta}{\sqrt{G_{ii} + \epsilon}} \nabla J(\theta) $$ **Keras Code:** @@ -91,8 +91,7 @@ RMSprop modifies AdaGrad to perform better in the non-convex setting by changing **Mathematical Equation:** -$$ E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g_t^2 $$ -$$ \theta = \theta - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla J(\theta) $$ +$$\theta = \theta - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla J(\theta)$$ **Keras Code:**