# fluid.optimizer¶

The update is done as follows:

\begin{align}\begin{aligned}moment\_out &= moment + grad * grad\\param\_out &= param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align}

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have the epsilon attribute. It is added here in our implementation as also proposed here: http://cs231n.github.io/neural-networks-3/#ada for numerical stability to avoid the division by zero error.

Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. epsilon (float) – a small float value for numerical stability. regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer. name – A optional name prefix.

Examples

optimizer.minimize(cost)

We implement the Adamax optimizer from Section 7 of the Adam paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the Adam algorithm based on the infinity norm.

\begin{align}\begin{aligned}t & = t + 1\\moment\_out & = {\beta}_1 * moment + (1 - {\beta}_1) * grad\\inf\_norm\_out & = max({\beta}_2 * inf\_norm + \epsilon, |grad|)\\learning\_rate & = \frac{learning\_rate}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out}\end{aligned}\end{align}

The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the division by 0 error.

Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. beta1 (float) – The exponential decay rate for the 1st moment estimates. beta2 (float) – The exponential decay rate for the 2nd moment estimates. epsilon (float) – a small float value for numerical stability. regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer. name – A optional name prefix.

Examples

optimizer.minimize(cost)

Notes

Currently, AdamaxOptimizer doesn’t support sparse parameter optimization.

This implements the Adam optimizer from Section 2 of the Adam paper : https://arxiv.org/abs/1412.6980. Adam is a first-order gradient-based optimization method based on adaptive estimates of lower-order moments.

\begin{align}\begin{aligned}t & = t + 1\\moment\_1\_out & = {\beta}_1 * moment\_1 + (1 - {\beta}_1) * grad\\moment\_2\_out & = {\beta}_2 * moment\_2 + (1 - {\beta}_2) * grad * grad\\learning\_rate & = learning\_rate * \ \frac{\sqrt{1 - {\beta}_2^t}}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon}\end{aligned}\end{align}
Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. beta1 (float) – The exponential decay rate for the 1st moment estimates. beta2 (float) – The exponential decay rate for the 2nd moment estimates. epsilon (float) – a small float value for numerical stability. regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer. name – A optional name prefix.

Examples

optimizer.minimize(cost)

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

The update is done as follows:

\begin{align}\begin{aligned}moment\_out & = decay * moment + (1 - decay) * grad * grad\\param\_out & = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align}

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.

Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. decay (float) – decay rate. epsilon (float) – a small float value for numerical stability. regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer. name – A optional name prefix.

Examples

optimizer.minimize(cost)

Notes

## Ftrl¶

alias of FtrlOptimizer

## FtrlOptimizer¶

class paddle.fluid.optimizer.FtrlOptimizer(learning_rate, l1=0.0, l2=0.0, lr_power=-0.5, regularization=None, name=None)

\begin{align}\begin{aligned}&new\_accum = squared\_accum + grad^2\\&if (lr\_power == -0.5):\\&\quad linear\_accum += grad - \frac{\sqrt{new\_accum} - \sqrt{squared\_accum}}{learning\_rate * param}\\&else:\\&\quad linear\_accum += grad - \frac{new\_accum^{-lr\_power} - accum^{-lr\_power}}{learning\_rate * param}\\ &x = l1 * sign(linear\_accum) - linear\_accum\\&if (lr\_power == -0.5):\\&\quad y = \frac{\sqrt{new\_accum}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&else:\\&\quad y = \frac{new\_accum^{-lr\_power}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&squared\_accum += grad^2\end{aligned}\end{align}
Parameters: learning_rate (float|Variable) – global learning rate. l1 (float) – l2 (float) – lr_power (float) – regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer. name – A optional name prefix. ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.Ftrl(0.0001)

Notes

Currently, FtrlOptimizer doesn’t support sparse parameter optimization.

## ModelAverage¶

class paddle.fluid.optimizer.ModelAverage(average_window_rate, min_average_window=10000, max_average_window=10000, regularization=None, name=None)

Accumulate the average of parameters whtin sliding window. The average result will be saved in temporary variables which can be applied to parameter variables of current model by calling ‘apply()’ method. And the ‘restore()’ method is used to restored the parameter values of current model.

The size of average window is determined by average_window_rate, min_average_window, max_average_window and current update times.

Parameters: average_window_rate – The rate of average window. min_average_window – The minimum size of average window. max_average_window – The maximum size of average window. regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer. name – A optional name prefix.

Examples

optimizer = fluid.optimizer.Momentum()
optimizer.minimize(cost)
model_average = fluid.optimizer.ModelAverage(0.15,
min_average_window=10000,
max_average_window=20000)
for pass_id in range(args.pass_num):
exe.run(fluid.default_main_program()...)

with model_average.apply(exe):
exe.run(inference_program...)
apply(*args, **kwds)

Apply average values to parameters of current model.

restore(executor)

Restore parameter values of current model.

## Momentum¶

alias of MomentumOptimizer

## MomentumOptimizer¶

class paddle.fluid.optimizer.MomentumOptimizer(learning_rate, momentum, use_nesterov=False, regularization=None, name=None)

Simple Momentum optimizer with velocity state

This optimizer has a flag for Nestrov Momentum.

The update equations are as follows:

\begin{align}\begin{aligned}& velocity = mu * velocity + gradient\\& if (use\_nesterov):\\&\quad param = param - (gradient + mu * velocity) * learning\_rate\\& else:\\&\quad param = param - learning\_rate * velocity\end{aligned}\end{align}
Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. momentum (float) – momentum factor use_nesterov (bool) – enables Nesterov momentum regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer. name – A optional name prefix.

Examples

optimizer = fluid.optimizer.Momentum(learning_rate=0.2, momentum=0.1)
optimizer.minimize(cost)

## RMSPropOptimizer¶

class paddle.fluid.optimizer.RMSPropOptimizer(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, centered=False, regularization=None, name=None)

Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align}

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by $sqrt{v(w,t)}$.

In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align}

if centered is True:

\begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\g(w, t) & = \rho g(w, t-1) + (1 - \rho)\nabla Q_{i}(w)\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align}

where, $\rho$ is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters:
• learning_rate (float) – global learning rate.
• rho (float) – rho is :math: rho in equation, set 0.95 by default.
• epsilon (float) – math: epsilon in equation is smoothing term to

avoid division by zero, set 1e-6 by default.

• momentum (float) – $\beta$ in equation is the momentum term, set 0.0 by default.
• centered (bool) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
• regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
• name – A optional name prefix.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.RMSProp(0.0001)

## RMSPropOptimizer¶

class paddle.fluid.optimizer.RMSPropOptimizer(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, centered=False, regularization=None, name=None)

Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align}

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by $sqrt{v(w,t)}$.

In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align}

if centered is True:

\begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\g(w, t) & = \rho g(w, t-1) + (1 - \rho)\nabla Q_{i}(w)\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align}

where, $\rho$ is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters:
• learning_rate (float) – global learning rate.
• rho (float) – rho is :math: rho in equation, set 0.95 by default.
• epsilon (float) – math: epsilon in equation is smoothing term to

avoid division by zero, set 1e-6 by default.

• momentum (float) – $\beta$ in equation is the momentum term, set 0.0 by default.
• centered (bool) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
• regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
• name – A optional name prefix.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.RMSProp(0.0001)

## SGD¶

alias of SGDOptimizer

## SGDOptimizer¶

$param\_out = param - learning\_rate * grad$