Optimizer¶
Neural network in essence is a Optimization problem . With forward computing and back propagation , Optimizer
use back-propagation gradients to optimize parameters in a neural network.
1. Adam¶
Optimizer of Adam is a method to adaptively adjust learning rate, fit for most non-convex optimization , big data set and high-dimensional scenarios. Adam
is the most common optimization algorithm.
API Reference: Adam
2. SGD¶
SGD
is an offspring class of Optimizer
implementing Stochastic Gradient Descent which is a method of Gradient Descent . When it needs to train a large number of samples, we usually choose SGD
to make loss function converge more quickly.
API Reference: SGD
3. Momentum¶
Momentum
optimizer adds momentum on the basis of SGD
, reducing noise problem in the process of random gradient descent. You can set use_nesterov
as False or True, respectively corresponding to traditional Momentum(Section 4.1 in thesis) algorithm and Nesterov accelerated gradient(Section 4.2 in thesis) algorithm.
API Reference: Momentum
4. AdamW¶
AdamW optimizer is an improved version of Adam
, which decouples weight decay (regularization) from gradient updates, addressing the issue of L2 regularization failure in Adam
.
API Reference: AdamW
5. Adagrad¶
Adagrad optimizer can adaptively allocate different learning rates for parameters to solve the problem of different sample sizes for different parameters.
API Reference: Adagrad
6. RMSProp¶
RMSProp optimizer is a method to adaptively adjust learning rate. It mainly solves the problem of dramatic decrease of learning rate in the mid-term and end term of model training after Adagrad
is used.
API Reference: RMSProp
7. Adamax¶
Adamax is a variant of Adam
algorithm, simplifying limits of learning rate, especially upper limit.
API Reference: Adamax
8. Lamb¶
Lamb aims to increase the batch size during training without compromising accuracy, supporting adaptive element-wise updates and precise layer-wise correction.
API Reference: Lamb
9. NAdam¶
NAdam optimizer is based on Adam
, combining the advantages of Nesterov
momentum and Adam
’s adaptive learning rate.
API Reference: NAdam
10. RAdam¶
RAdam optimizer improves upon Adam
by introducing an adaptive learning rate warmup strategy, enhancing the initial stability of training.
API Reference: RAdam
11. ASGD¶
ASGD optimizer, it is a strategy version of SGD
that trades space for time, and is a stochastic optimization method with trajectory averaging. On the basis of SGD
, ASGD
adds a measure of the average value of historical parameters, making the variance of noise in the descending direction decrease in a decreasing trend, so that the algorithm will eventually converge to the optimal value at a linear speed.
API Reference: ASGD
12. Rprop¶
Rprop optimizer, this method considers that the magnitude of gradients for different weight parameters may vary greatly, making it difficult to find a global learning step size. Therefore, an innovative method is proposed to accelerate the optimization process by dynamically adjusting the learning step size through the use of parameter gradient symbols.
API Reference: Rprop