fluid.optimizer

Adadelta

paddle.fluid.optimizer.Adadelta

alias of paddle.fluid.optimizer.AdadeltaOptimizer

Adagrad

paddle.fluid.optimizer.Adagrad

alias of paddle.fluid.optimizer.AdagradOptimizer

AdagradOptimizer

class paddle.fluid.optimizer.AdagradOptimizer(learning_rate, epsilon=1e-06, regularization=None, name=None, initial_accumulator_value=0.0)[source]

The Adaptive Gradient optimizer (Adagrad for short) can adaptively assign different learning rates to individual parameters.

The parameter param_out update rule with gradient grad:

\[ \begin{align}\begin{aligned}moment\_out &= moment + grad * grad\\param\_out &= param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align} \]

Related paper: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

The original paper does not have the epsilon attribute. It is added here in our implementation as also proposed Per-parameter adaptive learning rate methods for numerical stability to avoid the division by zero error.

Parameters
  • learning_rate (float|Variable) – The learning rate used to update Parameter. It can be a float value or a Variable with a float type.

  • epsilon (float, optional) – A small float value for numerical stability. The default value is 1e-06.

  • regularization (WeightDecayRegularizer, optional) – A Regularizer, such as L2DecayRegularizer. The default value is None.

  • name (str, optional) – Normally there is no need for user to set this property. For more information, please refer to api_guide_Name. The default value is None.

  • initial_accumulator_value (float, optional) – Initial value for moment accumulator. The default value is 0.0.

Examples

import numpy as np
import paddle.fluid as fluid

np_inp = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32)
inp = fluid.data(name="inp", shape=[2, 2])
out = fluid.layers.fc(inp, size=3)
out = fluid.layers.reduce_sum(out)
optimizer = fluid.optimizer.AdagradOptimizer(learning_rate=0.2)
optimizer.minimize(out)

exe = fluid.Executor(fluid.CPUPlace())
exe.run(fluid.default_startup_program())
exe.run(
    feed={"inp": np_inp},
    fetch_list=[out.name])
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

Adam

paddle.fluid.optimizer.Adam

alias of paddle.fluid.optimizer.AdamOptimizer

Adamax

paddle.fluid.optimizer.Adamax

alias of paddle.fluid.optimizer.AdamaxOptimizer

AdamaxOptimizer

class paddle.fluid.optimizer.AdamaxOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, regularization=None, name=None)[source]

The Adamax optimizer is implemented based on the Adamax Optimization in Section 7 of Adam paper. The Adamax algorithm is a variant of the Adam algorithm based on the infinite norm, which makes the learning rate update algorithm more stable and simple.

The parameter param_out update rule with gradient grad:

\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_out & = {\beta}_1 * moment + (1 - {\beta}_1) * grad\\inf\_norm\_out & = max({\beta}_2 * inf\_norm + \epsilon, |grad|)\\learning\_rate & = \frac{learning\_rate}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out}\end{aligned}\end{align} \]

Related paper: Adam: A Method for Stochastic Optimization

The original paper does not have an epsilon attribute, it is added here for numerical stability to prevent the division by 0 error.

Parameters
  • learning_rate (float|Variable, optional) – The learning rate used to update Parameter. It can be a float value or a Variable with a float type. The default value is 0.001.

  • beta1 (float, optional) – The exponential decay rate for the 1st moment estimates. The default value is 0.9.

  • beta2 (float, optional) – The exponential decay rate for the 2nd moment estimates. The default value is 0.999.

  • epsilon (float, optional) – A small float value for numerical stability. The default value is 1e-08.

  • regularization (WeightDecayRegularizer, optional) – A Regularizer, such as L2DecayRegularizer. The default value is None.

  • name (str, optional) – Normally there is no need for user to set this property. For more information, please refer to api_guide_Name. The default value is None.

Notes:

Currently, AdamaxOptimizer doesn’t support sparse parameter optimization.

Examples

import paddle.fluid as fluid
import numpy

# First create the Executor.
place = fluid.CPUPlace() # fluid.CUDAPlace(0)
exe = fluid.Executor(place)

train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
    data = fluid.data(name='X', shape=[None, 1], dtype='float32')
    hidden = fluid.layers.fc(input=data, size=10)
    loss = fluid.layers.mean(hidden)
    adam = fluid.optimizer.AdamaxOptimizer(learning_rate=0.2)
    adam.minimize(loss)

# Run the startup program once and only once.
exe.run(startup_program)

x = numpy.random.random(size=(10, 1)).astype('float32')
outs = exe.run(program=train_program,
              feed={'X': x},
               fetch_list=[loss.name])
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

AdamOptimizer

class paddle.fluid.optimizer.AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, regularization=None, name=None, lazy_mode=False)[source]

The Adam optimzier uses an optimization described at the end of section 2 of Adam paper , it can dynamically adjusts the learning rate of each parameter using the 1st moment estimates and the 2nd moment estimates of the gradient.

The parameter param_out update rule with gradient grad:

\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_1\_out & = {\beta}_1 * moment\_1 + (1 - {\beta}_1) * grad\\moment\_2\_out & = {\beta}_2 * moment\_2 + (1 - {\beta}_2) * grad * grad\\learning\_rate & = learning\_rate * \ \frac{\sqrt{1 - {\beta}_2^t}}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon}\end{aligned}\end{align} \]

Related paper: Adam: A Method for Stochastic Optimization

Parameters
  • learning_rate (float|Variable, optional) – The learning rate used to update Parameter. It can be a float value or a Variable with a float type. The default value is 0.001.

  • beta1 (float, optional) – The exponential decay rate for the 1st moment estimates. The default value is 0.9.

  • beta2 (float, optional) – The exponential decay rate for the 2nd moment estimates. The default value is 0.999.

  • epsilon (float, optional) – A small float value for numerical stability. The default value is 1e-08.

  • regularization (WeightDecayRegularizer, optional) – A Regularizer, such as L2DecayRegularizer. The default value is None.

  • name (str, optional) – Normally there is no need for user to set this property. For more information, please refer to api_guide_Name. The default value is None.

  • lazy_mode (bool, optional) – The official Adam algorithm has two moving-average accumulators. The accumulators are updated at every step. Every element of the two moving-average is updated in both dense mode and sparse mode. If the size of parameter is very large, then the update may be very slow. The lazy mode only update the element that has gradient in current mini-batch, so it will be much more faster. But this mode has different semantics with the original Adam algorithm and may lead to different result. The default value is False.

Examples

import paddle
import paddle.fluid as fluid

place = fluid.CPUPlace()
main = fluid.Program()
with fluid.program_guard(main):
    x = fluid.data(name='x', shape=[None, 13], dtype='float32')
    y = fluid.data(name='y', shape=[None, 1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    avg_cost = fluid.layers.mean(cost)

    adam_optimizer = fluid.optimizer.AdamOptimizer(0.01)
    adam_optimizer.minimize(avg_cost)

    fetch_list = [avg_cost]
    train_reader = paddle.batch(
        paddle.dataset.uci_housing.train(), batch_size=1)
    feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    for data in train_reader():
        exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

DecayedAdagrad

paddle.fluid.optimizer.DecayedAdagrad

alias of paddle.fluid.optimizer.DecayedAdagradOptimizer

DecayedAdagradOptimizer

class paddle.fluid.optimizer.DecayedAdagradOptimizer(learning_rate, decay=0.95, epsilon=1e-06, regularization=None, name=None)[source]

The Decayed Adagrad optimizer can be seen as an Adagrad algorithm that introduces the decay rate to solve the problem of a sharp drop in the learning rate during model training when using the AdagradOptimizer.

The parameter param_out update rule with gradient grad:

\[ \begin{align}\begin{aligned}moment\_out & = decay * moment + (1 - decay) * grad * grad\\param\_out & = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align} \]

Related paper: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

The original paper does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.

Parameters
  • learning_rate (float|Variable) – The learning rate used to update Parameter. It can be a float value or a Variable with a float type.

  • decay (float, optional) – The decay rate. The default value is 0.95.

  • epsilon (float, optional) – A small float value for numerical stability. The default value is 1e-06.

  • regularization (WeightDecayRegularizer, optional) – A Regularizer, such as L2DecayRegularizer. The default value is None.

  • name (str, optional) – Normally there is no need for user to set this property. For more information, please refer to api_guide_Name. The default value is None.

Notes:

Currently, DecayedAdagradOptimizer doesn’t support sparse parameter optimization.

Examples

import paddle.fluid as fluid

x = fluid.data( name='x', shape=[None, 10], dtype='float32' )
trans = fluid.layers.fc( x, 100 )
cost = fluid.layers.reduce_mean( trans )
optimizer = fluid.optimizer.DecayedAdagradOptimizer(learning_rate=0.2)
optimizer.minimize(cost)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

DGCMomentumOptimizer

class paddle.fluid.optimizer.DGCMomentumOptimizer(learning_rate, momentum, rampup_begin_step, rampup_step=1, sparsity=[0.999], use_nesterov=False, local_grad_clip_norm=None, num_trainers=None, regularization=None, name=None)[source]

DGC (Deep Gradient Compression) Momentum Optimizer. Original paper is https://arxiv.org/abs/1712.01887

DGC reduces the communication bandwidth by sending only the important gradients (sparse update): only gradients larger than a threshold are transmitted.

To avoid losing information, DGC accumulates the rest of the gradients locally.

Eventually, these gradients become large enough to be transmitted.

Thus, DGC sends the large gradients immediately but eventually sends all of the gradients over time.

To ensure no loss of accuracy, DGC employs momentum correction and local gradient clipping on top of the gradient sparsification to maintain model performance.

DGC also uses momentum factor masking and warmup training to overcome the staleness problem caused by reduced communication.

This optimizer will do two things:

  1. Compress the gradient by get TopK import value from tensor and use it for allreduce to reduce network bandwidth.

  2. Call momentum to optimize the cost.

Parameters
  • learning_rate (float|Variable) – The learning rate used to update parameters. It can be a float value or a Variable with one float value as a data element.

  • momentum (float) – Momentum factor.

  • rampup_begin_step (int) – The beginning step from which gradient compression is implemented.

  • rampup_step (int) – Time steps used in sparsity warm-up periods. Default is 1. For example, if the sparsity is [0.75, 0.9375, 0.984375, 0.996, 0.999], and the rampup_step is 100, it will use 0.75 at 0~19 steps, and 0.9375 at 20~39 steps, and so on. And when reach sparsity array ends, it will use 0.999 then and after.

  • sparsity (list[float]) – Get top important element from gradient tensor, the ratio is (1 - current sparsity). Default is [0.999]. For example, if the sparsity is [0.99, 0.999], the top [1%, 0.1%] important element will be transmitted.

  • use_nesterov (bool) – Enables Nesterov momentum. True means use Nesterov. Default is False.

  • local_grad_clip_norm (float, optional) – Local gradient clip norm value. Optional, default is None, represent no need clip.

  • num_trainers (int, optional) – The number of training nodes. Optional, default is None.

  • regularization (WeightDecayRegularizer, optional) – A Regularizer, such as L2DecayRegularizer. Optional, default is None.

  • name (str, optional) – This parameter is used by developers to print debugging information. For details, please refer to api_guide_Name. Default is None.

Examples

import paddle.fluid as fluid
optimizer = fluid.optimizer.DGCMomentumOptimizer(
            learning_rate=0.0001,
            momentum=0.9,
            rampup_step=1000,
            rampup_begin_step=1252,
            sparsity=[0.999, 0.999])
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

ExponentialMovingAverage

class paddle.fluid.optimizer.ExponentialMovingAverage(decay=0.999, thres_steps=None, name=None)[source]

Compute the moving average of parameters with exponential decay. Given a parameter \(\theta\), its exponential moving average (EMA) will be

\[ \begin{align}\begin{aligned}\text{EMA}_0 & = 0\\\text{EMA}_t & = \text{decay} * \text{EMA}_{t-1} + (1 - \text{decay}) * \theta_t\end{aligned}\end{align} \]

The average results calculated by update() method will be saved in temporary variables which are created and maintained by the object, and can be applied to parameters of current model by calling apply() method. And the restore() method is used to restore the parameters.

Bias correction. All EMAs are initialized to \(0\) and hence they will be zero biased, which can be corrected by divided by a factor \((1 - \text{decay}^t)\) , i.e., the actual EMAs applied to parameters when calling apply() method would be

\[\widehat{\text{EMA}}_t = \frac{\text{EMA}_t}{1 - \text{decay}^t}\]

Decay rate scheduling. A large decay rate very close to 1 would result in that the averages move very slowly. And a better strategy is to set a relative smaller decay rate in the very beginning. The argument thres_steps allows users to pass a Variable to schedule the decay rate, in this case, the actual decay rate becomes

\[\min(\text{decay}, \frac{1 + \text{thres_steps}}{10 + \text{thres_steps}})\]

Usually thres_steps can be the global training steps.

Parameters
  • decay (float, optional) – The exponential decay rate, usually close to 1, such as 0.999, 0.9999, … . Default 0.999.

  • thres_steps (Variable|None) – If not None, schedule the decay rate. Default None.

  • name (str|None) – For detailed information, please refer to api_guide_Name. Usually name is no need to set and None by default.

Examples

import numpy
import paddle
import paddle.fluid as fluid

data = fluid.data(name='x', shape=[-1, 5], dtype='float32')
hidden = fluid.layers.fc(input=data, size=10)
cost = fluid.layers.mean(hidden)

test_program = fluid.default_main_program().clone(for_test=True)

optimizer = fluid.optimizer.Adam(learning_rate=0.001)
optimizer.minimize(cost)

global_steps = fluid.layers.learning_rate_scheduler._decay_step_counter()
ema = fluid.optimizer.ExponentialMovingAverage(0.999, thres_steps=global_steps)
ema.update()

place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())

for pass_id in range(3):
    for batch_id in range(6):
        data = numpy.random.random(size=(10, 5)).astype('float32')
        exe.run(program=fluid.default_main_program(),
            feed={'x': data},
            fetch_list=[cost.name])

    # usage 1
    with ema.apply(exe):
        data = numpy.random.random(size=(10, 5)).astype('float32')
        exe.run(program=test_program,
                feed={'x': data},
                fetch_list=[hidden.name])


     # usage 2
    with ema.apply(exe, need_restore=False):
        data = numpy.random.random(size=(10, 5)).astype('float32')
        exe.run(program=test_program,
                feed={'x': data},
                fetch_list=[hidden.name])
    ema.restore(exe)
update()

Update Exponential Moving Average. Should only call this method in train program.

apply(executor, need_restore=True)

Apply moving average to parameters for evaluation.

Parameters
  • executor (Executor) – The Executor to execute applying.

  • need_restore (bool, optional) – Whether to restore parameters after applying. Default True.

restore(executor)

Restore parameters.

Parameters

executor (Executor) – The Executor to execute restoring.

Ftrl

paddle.fluid.optimizer.Ftrl

alias of paddle.fluid.optimizer.FtrlOptimizer

FtrlOptimizer

class paddle.fluid.optimizer.FtrlOptimizer(learning_rate, l1=0.0, l2=0.0, lr_power=-0.5, regularization=None, name=None)[source]

FTRL (Follow The Regularized Leader) Optimizer.

The paper that proposed Follow The Regularized Leader (FTRL): (https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf)

\[ \begin{align}\begin{aligned}&new\_accum = squared\_accum + grad^2\\&if (lr\_power == -0.5):\\&\quad linear\_accum += grad - \frac{\sqrt{new\_accum} - \sqrt{squared\_accum}}{learning\_rate * param}\\&else:\\&\quad linear\_accum += grad - \frac{new\_accum^{-lr\_power} - accum^{-lr\_power}}{learning\_rate * param}\\ &x = l1 * sign(linear\_accum) - linear\_accum\\&if (lr\_power == -0.5):\\&\quad y = \frac{\sqrt{new\_accum}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&else:\\&\quad y = \frac{new\_accum^{-lr\_power}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&squared\_accum += grad^2\end{aligned}\end{align} \]
Parameters
  • learning_rate (float|Variable) – Global learning rate.

  • l1 (float) – L1 regularization strength, default is 0.0.

  • l2 (float) – L2 regularization strength, default is 0.0.

  • lr_power (float) – Learning Rate Power, default is -0.5.

  • regularization – A Regularizer, such as L2DecayRegularizer. Optional, default is None.

  • name (str, optional) – This parameter is used by developers to print debugging information. For details, please refer to api_guide_Name. Default is None.

Raises

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

import paddle
import paddle.fluid as fluid
import numpy as np

place = fluid.CPUPlace()
main = fluid.Program()
with fluid.program_guard(main):
    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    avg_cost = fluid.layers.mean(cost)

    ftrl_optimizer = fluid.optimizer.Ftrl(learning_rate=0.1)
    ftrl_optimizer.minimize(avg_cost)

    fetch_list = [avg_cost]
    train_reader = paddle.batch(
        paddle.dataset.uci_housing.train(), batch_size=1)
    feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    for data in train_reader():
        exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)

Note

Currently, FtrlOptimizer doesn’t support sparse parameter optimization.

apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

LambOptimizer

class paddle.fluid.optimizer.LambOptimizer(learning_rate=0.001, lamb_weight_decay=0.01, beta1=0.9, beta2=0.999, epsilon=1e-06, regularization=None, exclude_from_weight_decay_fn=None, name=None)[source]

LAMB (Layer-wise Adaptive Moments optimizer for Batching training) Optimizer.

LAMB Optimizer is designed to scale up the batch size of training without losing accuracy, which supports adaptive element-wise updating and accurate layer-wise correction. For more information, please refer to Large Batch Optimization for Deep Learning: Training BERT in 76 minutes .

The updating of parameters follows:

\[ \begin{align}\begin{aligned}m_t &= \beta_1 m_{t - 1}+ (1 - \beta_1)g_t\\v_t &= \beta_2 v_{t - 1} + (1 - \beta_2)g_t^2\\r_t &= \frac{m_t}{\sqrt{v_t}+\epsilon}\\w_t &= w_{t-1} -\eta_t \frac{\left \| w_{t-1}\right \|}{\left \| r_t + \lambda w_{t-1}\right \|} (r_t + \lambda w_{t-1})\end{aligned}\end{align} \]

where \(m\) is the 1st moment, and \(v\) the 2nd moment, \(\eta\) the learning rate, \(\lambda\) the LAMB weight decay rate.

Parameters
  • learning_rate (float|Variable, optional) – the learning rate used to update parameters. Can be a float value or a Variable with data type float32. Default 0.001.

  • lamb_weight_decay (float, optional) – The LAMB weight decay rate. Default 0.01.

  • beta1 (float, optional) – The exponential decay rate for the 1st moment estimates. Default 0.9.

  • beta2 (float, optional) – The exponential decay rate for the 2nd moment estimates. Default 0.999.

  • epsilon (float, optional) – A small float value for numerical stability. Default 1e-6.

  • regularization (Regularizer|None) – A Regularizer, such as fluid.regularizer.L1DecayRegularizer. Default None.

  • exclude_from_weight_decay_fn (function|None) – Exclude a parameter from weight decay when exclude_from_weight_decay_fn(parameter) returns true. Default None.

  • name (str|None) – For detailed information, please refer to api_guide_Name . Usually name is no need to set and None by default.

Examples

import paddle.fluid as fluid

data = fluid.data(name='x', shape=[-1, 5], dtype='float32')
hidden = fluid.layers.fc(input=data, size=10)
cost = fluid.layers.mean(hidden)

def exclude_fn(param):
    return param.name.endswith('.b_0')

optimizer = fluid.optimizer.Lamb(learning_rate=0.002,
                                 exclude_from_weight_decay_fn=exclude_fn)
optimizer.minimize(cost)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

LarsMomentum

paddle.fluid.optimizer.LarsMomentum

alias of paddle.fluid.optimizer.LarsMomentumOptimizer

LarsMomentumOptimizer

class paddle.fluid.optimizer.LarsMomentumOptimizer(learning_rate, momentum, lars_coeff=0.001, lars_weight_decay=0.0005, regularization=None, name=None)[source]

Momentum optimizer with LARS support

The update equations are as follows:

\[ \begin{align}\begin{aligned}& local\_learning\_rate = learning\_rate * lars\_coeff * \ \frac{||param||}{||gradient|| + lars\_weight\_decay * ||param||}\\& velocity = mu * velocity + local\_learning\_rate * (gradient + lars\_weight\_decay * param)\\& param = param - velocity\end{aligned}\end{align} \]
Parameters
  • learning_rate (float|Variable) – The learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. momentum (float): momentum factor

  • lars_coeff (float) – Defines how much we trust the layer to change its weights.

  • lars_weight_decay (float) – Weight decay coefficient for decaying using LARS.

  • regularization – A Regularizer, such as L2DecayRegularizer. Optional, default is None.

  • name (str, optional) – This parameter is used by developers to print debugging information. For details, please refer to api_guide_Name. Default is None.

Examples

import paddle.fluid as fluid
import numpy as np

np_inp = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32)
inp = fluid.layers.data(
    name="inp", shape=[2, 2], append_batch_size=False)
out = fluid.layers.fc(inp, size=3)
out = fluid.layers.reduce_sum(out)
optimizer = fluid.optimizer.LarsMomentumOptimizer(learning_rate=0.001, momentum=0.9)
optimizer.minimize(out)

exe = fluid.Executor(fluid.CPUPlace())
exe.run(fluid.default_startup_program())
exe.run(
    feed={"inp": np_inp},
    fetch_list=[out.name])
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

ModelAverage

class paddle.fluid.optimizer.ModelAverage(average_window_rate, min_average_window=10000, max_average_window=10000, regularization=None, name=None)[source]

The ModelAverage optimizer accumulates specific continuous historical parameters during training. The accumulated historical range can be controlled by the passed average_window_rate argument. The averaged Parameter are used in the prediction, which usually can improve the accuracy of the prediction.

Accumulate the average of the Parameter in the sliding window, the result will be saved in a temporary variable, can be applied to the current model’s Parameter by calling the apply() method, and the current model Parameter can be restored by calling the restore() method.

The window size for calculating the average is determined by average_window_rate, min_average_window, max_average_window and the current Parameter update times (num_updates).

When the cumulative times (num_accumulates) is greater than the specific window threshold (average_window), the accumulated Parameter temporary variable is set to 0.0. The following example will help to understand the role of these arguments:

if num_accumulates >= min_average_window and num_accumulates >= min(max_average_window, num_updates * average_window_rate):
    num_accumulates = 0

In the above conditional judgment statement, num_accumulates indicates the current accumulated number, which can be abstractly understood as the length of the cumulative window. The length of the window must be at least the length set by the min_average_window argument, and cannot exceed the length specified by the max_average_window argument or num_updates * average_window_rate, where num_updates indicates the current Parameter update times, average_window_rate is a coefficient that calculates the length of the window.

Parameters
  • average_window_rate (float) – The calculate ratio of the window length relative to Parameter update times.

  • min_average_window (int, optional) – the minimum size of average window length. The default value is 10000.

  • max_average_window (int, optional) – The maximum size of average window length. The default value is 10000.

  • regularization (WeightDecayRegularizer, optional) – A Regularizer, such as L2DecayRegularizer. The default value is None.

  • name (str, optional) – Normally there is no need for user to set this property. For more information, please refer to api_guide_Name. The default value is None.

Examples

import paddle.fluid as fluid
import numpy

# First create the Executor.
place = fluid.CPUPlace()  # fluid.CUDAPlace(0)
exe = fluid.Executor(place)

train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
    # build net
    data = fluid.data(name='X', shape=[None, 1], dtype='float32')
    hidden = fluid.layers.fc(input=data, size=10)
    loss = fluid.layers.mean(hidden)
    optimizer = fluid.optimizer.Momentum(learning_rate=0.2, momentum=0.1)
    optimizer.minimize(loss)

    # build ModelAverage optimizer
    model_average = fluid.optimizer.ModelAverage(0.15,
                                                 min_average_window=10000,
                                                 max_average_window=12500)

    exe.run(startup_program)
    for i in range(12500):
        x = numpy.random.random(size=(10, 1)).astype('float32')
        outs = exe.run(program=train_program,
                       feed={'X': x},
                       fetch_list=[loss.name])

    # apply ModelAverage
    with model_average.apply(exe):
        x = numpy.random.random(size=(10, 1)).astype('float32')
        exe.run(program=train_program,
                feed={'X': x},
                fetch_list=[loss.name])
apply(executor, need_restore=True)

Apply the average of the cumulative Parameter to the parameters of the current model.

Parameters
  • executor (fluid.Executor) – The current network executor.

  • need_restore (bool) – Restore flag variable, if set to True, the network will restore the parameters of the network to the default value, if set to False, it will not be restored. The default value is True.

Examples

import paddle.fluid as fluid
import numpy

# First create the Executor.
place = fluid.CPUPlace()  # fluid.CUDAPlace(0)
exe = fluid.Executor(place)

train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
    # build net
    data = fluid.data(name='X', shape=[None, 1], dtype='float32')
    hidden = fluid.layers.fc(input=data, size=10)
    loss = fluid.layers.mean(hidden)
    optimizer = fluid.optimizer.Momentum(learning_rate=0.2, momentum=0.1)
    optimizer.minimize(loss)

    # build ModelAverage optimizer
    model_average = fluid.optimizer.ModelAverage(0.15,
                                                min_average_window=10000,
                                                max_average_window=12500)

    exe.run(startup_program)
    for i in range(12500):
        x = numpy.random.random(size=(10, 1)).astype('float32')
        outs = exe.run(program=train_program,
                    feed={'X': x},
                    fetch_list=[loss.name])

    # apply ModelAverage
    with model_average.apply(exe):
        x = numpy.random.random(size=(10, 1)).astype('float32')
        exe.run(program=train_program,
                feed={'X': x},
                fetch_list=[loss.name])
restore(executor)

Restore Parameter values of current model.

Parameters

executor (fluid.Executor) – The current network executor.

Examples

import paddle.fluid as fluid
import numpy

# First create the Executor.
place = fluid.CPUPlace()  # fluid.CUDAPlace(0)
exe = fluid.Executor(place)

train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
    # build net
    data = fluid.data(name='X', shape=[None, 1], dtype='float32')
    hidden = fluid.layers.fc(input=data, size=10)
    loss = fluid.layers.mean(hidden)
    optimizer = fluid.optimizer.Momentum(learning_rate=0.2, momentum=0.1)
    optimizer.minimize(loss)

    # build ModelAverage optimizer
    model_average = fluid.optimizer.ModelAverage(0.15,
                                                min_average_window=10000,
                                                max_average_window=12500)

    exe.run(startup_program)
    for i in range(12500):
        x = numpy.random.random(size=(10, 1)).astype('float32')
        outs = exe.run(program=train_program,
                    feed={'X': x},
                    fetch_list=[loss.name])

    # apply ModelAverage
    with model_average.apply(exe, False):
        x = numpy.random.random(size=(10, 1)).astype('float32')
        exe.run(program=train_program,
                feed={'X': x},
                fetch_list=[loss.name])

    # restore Parameters
    model_average.restore(exe)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

Momentum

paddle.fluid.optimizer.Momentum

alias of paddle.fluid.optimizer.MomentumOptimizer

MomentumOptimizer

class paddle.fluid.optimizer.MomentumOptimizer(learning_rate, momentum, use_nesterov=False, regularization=None, name=None)[source]

Simple Momentum optimizer with velocity state

This optimizer has a flag for Nestrov Momentum.

The update equations are as follows:

\[ \begin{align}\begin{aligned}& velocity = mu * velocity + gradient\\& if (use\_nesterov):\\&\quad param = param - (gradient + mu * velocity) * learning\_rate\\& else:\\&\quad param = param - learning\_rate * velocity\end{aligned}\end{align} \]
Parameters
  • learning_rate (float|Variable) – The learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.

  • momentum (float) – Momentum factor

  • use_nesterov (bool, optional) – Enables Nesterov momentum, default is false.

  • regularization – A Regularizer, such as L2DecayRegularizer. Optional, default is None.

  • name (str, optional) – This parameter is used by developers to print debugging information. For details, please refer to api_guide_Name. Default is None.

Examples

import paddle
import paddle.fluid as fluid
import numpy as np

place = fluid.CPUPlace()
main = fluid.Program()
with fluid.program_guard(main):
    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    avg_cost = fluid.layers.mean(cost)

    moment_optimizer = fluid.optimizer.MomentumOptimizer(learning_rate=0.001, momentum=0.9)
    moment_optimizer.minimize(avg_cost)

    fetch_list = [avg_cost]
    train_reader = paddle.batch(
        paddle.dataset.uci_housing.train(), batch_size=1)
    feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    for data in train_reader():
        exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

PipelineOptimizer

class paddle.fluid.optimizer.PipelineOptimizer(optimizer, cut_list=None, place_list=None, concurrency_list=None, queue_size=30, sync_steps=1, start_cpu_core_id=0)[source]

Pipeline Optimizer

Train with pipeline mode. The program will be splited by cut_list.

If the len of cut_list is k, then the whole program (including backward part) will be splited to 2*k-1 sections.

So the length of place_list and concurrency_list must be also 2*k-1.

Note: Though the asynchronous mode is applied in pipeline training to speed up, the final performance depends on the training progress of each pipeline heavily.

And we will try the synchronous mode in the future.

Parameters
  • optimizer (Optimizer) – The based optimizer, such as SGD.

  • cut_list (list of Variable list) – The cut variable of the main_program.

  • place_list (list of Place) – The place where the section will run on.

  • concurrency_list (list of int) – The concurrency degree.

  • queue_size (int) – Each section will consume scopes from its in-scope queue and produce scopes to out-scope queue. And this parameter specify the scope queue size. [Optional. Default: 30].

  • sync_steps (int) – The synchronization steps between different cards. [Optional. Default: 1].

  • start_cpu_core_id (int) – specify the first cpu core id. [Optional. Default:0].

Examples

import paddle.fluid as fluid
import paddle.fluid.layers as layers

x = fluid.layers.data(name='x', shape=[1], dtype='int64', lod_level=0)
y = fluid.layers.data(name='y', shape=[1], dtype='int64', lod_level=0)
emb_x = layers.embedding(input=x, param_attr=fluid.ParamAttr(name="embx"), size=[10,2], is_sparse=False)
emb_y = layers.embedding(input=y, param_attr=fluid.ParamAttr(name="emby",learning_rate=0.9), size=[10,2], is_sparse=False)
concat = layers.concat([emb_x, emb_y], axis=1)
fc = layers.fc(input=concat, name="fc", size=1, num_flatten_dims=1, bias_attr=False)
loss = layers.reduce_mean(fc)
optimizer = fluid.optimizer.SGD(learning_rate=0.5)
optimizer = fluid.optimizer.PipelineOptimizer(optimizer,
        cut_list=[[emb_x, emb_y], [loss]],
        place_list=[fluid.CPUPlace(), fluid.CUDAPlace(0), fluid.CPUPlace()],
        concurrency_list=[1, 1, 4],
        queue_size=2,
        sync_steps=1,
        )
optimizer.minimize(loss)
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
filelist = [] # you should set your own filelist, e.g. filelist = ["dataA.txt"]
dataset = fluid.DatasetFactory().create_dataset("FileInstantDataset")
dataset.set_use_var([x,y])
dataset.set_batch_size(batch_size)
dataset.set_filelist(filelist)
exe.train_from_dataset(
            fluid.default_main_program(),
            dataset,
            thread=2,
            debug=False,
            fetch_list=[],
            fetch_info=[],
            print_period=1)

RMSPropOptimizer

class paddle.fluid.optimizer.RMSPropOptimizer(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, centered=False, regularization=None, name=None)[source]

Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align} \]

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(sqrt{v(w,t)}\).

In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]

if centered is True:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\g(w, t) & = \rho g(w, t-1) + (1 - \rho)\nabla Q_{i}(w)\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]

where, \(\rho\) is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters
  • learning_rate (float) – Global learning rate.

  • rho (float) – rho is :math: rho in equation, default is 0.95.

  • epsilon (float) –

    math

    epsilon in equation is smoothing term to

    avoid division by zero, default is 1e-6.

  • momentum (float) – \(\beta\) in equation is the momentum term, default is 0.0.

  • centered (bool) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.

  • regularization – A Regularizer, such as L2DecayRegularizer. Optional, default is None.

  • name (str, optional) – This parameter is used by developers to print debugging information. For details, please refer to api_guide_Name. Default is None.

Raises

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

import paddle
import paddle.fluid as fluid
import numpy as np

place = fluid.CPUPlace()
main = fluid.Program()
with fluid.program_guard(main):
    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    avg_cost = fluid.layers.mean(cost)

    rms_optimizer = fluid.optimizer.RMSProp(learning_rate=0.1)
    rms_optimizer.minimize(avg_cost)

    fetch_list = [avg_cost]
    train_reader = paddle.batch(
        paddle.dataset.uci_housing.train(), batch_size=1)
    feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    for data in train_reader():
        exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()

SGD

paddle.fluid.optimizer.SGD

alias of paddle.fluid.optimizer.SGDOptimizer

SGDOptimizer

class paddle.fluid.optimizer.SGDOptimizer(learning_rate, regularization=None, name=None)[source]

Optimizer of the stochastic gradient descent algorithm.

\[param\_out = param - learning\_rate * grad\]
Parameters
  • learning_rate (float|Variable) – The learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.

  • regularization – A Regularizer, such as L2DecayRegularizer. Optional, default is None.

  • name (str, optional) – This parameter is used by developers to print debugging information. For details, please refer to api_guide_Name. Default is None.

Examples

import paddle
import paddle.fluid as fluid
import numpy as np

place = fluid.CPUPlace()
main = fluid.Program()
with fluid.program_guard(main):
    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    avg_cost = fluid.layers.mean(cost)

    sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
    sgd_optimizer.minimize(avg_cost)

    fetch_list = [avg_cost]
    train_reader = paddle.batch(
        paddle.dataset.uci_housing.train(), batch_size=1)
    feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    for data in train_reader():
        exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters

params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program) – startup_program for initializing parameters in parameter_list.

  • params_grads (list) – list of (param, grad) pair to do optimization.

Returns

A list of operators appended to the current program.

Return type

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

The first part of minimize, do auto-diff to append backward operations for the current program.

Parameters
  • loss (Variable) – loss variable to run optimizations.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • callbacks (list, optional) – list of callable objects to run when appending backward operator for one parameter. The default value is None.

Returns

list of (param, grad) variable pairs, param is Parameter,

grad is the gradient value corresponding to the parameter.

Return type

list

Examples

See examples in apply_gradients.

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

Parameters
  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.

Returns

tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type

tuple

Examples

Please refer to the example of current Optimizer.

set_dict(state_dict)

Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.

Parameters

state_dict (dict) – Dict contains all the Variable needed by optimizer

Returns

None

Examples

with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )
state_dict()

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)

Examples

import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()