AdamaxOptimizer¶
-
class
paddle.fluid.optimizer.
AdamaxOptimizer
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, parameter_list=None, regularization=None, grad_clip=None, name=None)[source] The Adamax optimizer is implemented based on the Adamax Optimization in Section 7 of Adam paper. The Adamax algorithm is a variant of the Adam algorithm based on the infinite norm, which makes the learning rate update algorithm more stable and simple.
The parameter
param_out
update rule with gradientgrad
:\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_out & = {\beta}_1 * moment + (1 - {\beta}_1) * grad\\inf\_norm\_out & = max({\beta}_2 * inf\_norm + \epsilon, |grad|)\\learning\_rate & = \frac{learning\_rate}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out}\end{aligned}\end{align} \]Related paper: Adam: A Method for Stochastic Optimization
The original paper does not have an
epsilon
attribute, it is added here for numerical stability to prevent the division by 0 error.- Parameters
learning_rate (float|Variable, optional) – The learning rate used to update
Parameter
. It can be a float value or aVariable
with a float type. The default value is 0.001.beta1 (float, optional) – The exponential decay rate for the 1st moment estimates. The default value is 0.9.
beta2 (float, optional) – The exponential decay rate for the 2nd moment estimates. The default value is 0.999.
epsilon (float, optional) – A small float value for numerical stability. The default value is 1e-08.
parameter_list (list, optional) – List of
Variable
names to update to minimizeloss
. This parameter is required in dygraph mode. The default value is None in static mode, at this time all parameters will be updated.regularization (WeightDecayRegularizer, optional) – The strategy of regularization. There are two method: L1Decay , L2Decay . If a parameter has set regularizer using ParamAttr already, the regularization setting here in optimizer will be ignored for this parameter. Otherwise, the regularization setting here in optimizer will take effect. Default None, meaning there is no regularization.
grad_clip (GradientClipBase, optional) – Gradient cliping strategy, it’s an instance of some derived class of
GradientClipBase
. There are three cliping strategies ( GradientClipByGlobalNorm , GradientClipByNorm , GradientClipByValue ). Default None, meaning there is no gradient clipping.name (str, optional) – Normally there is no need for user to set this property. For more information, please refer to Name. The default value is None.
- Notes:
Currently, AdamaxOptimizer doesn’t support sparse parameter optimization.
Examples
import paddle.fluid as fluid import numpy # First create the Executor. place = fluid.CPUPlace() # fluid.CUDAPlace(0) exe = fluid.Executor(place) train_program = fluid.Program() startup_program = fluid.Program() with fluid.program_guard(train_program, startup_program): data = fluid.data(name='X', shape=[None, 1], dtype='float32') hidden = fluid.layers.fc(input=data, size=10) loss = fluid.layers.mean(hidden) adam = fluid.optimizer.AdamaxOptimizer(learning_rate=0.2) adam.minimize(loss) # Run the startup program once and only once. exe.run(startup_program) x = numpy.random.random(size=(10, 1)).astype('float32') outs = exe.run(program=train_program, feed={'X': x}, fetch_list=[loss.name])
-
clear_gradients
() Clear the gradients of all optimized parameters for model.
- Returns
None
Examples
import paddle.fluid as fluid import numpy as np with fluid.dygraph.guard(): value = np.arange(26).reshape(2, 13).astype("float32") a = fluid.dygraph.to_variable(value) linear = fluid.Linear(13, 5, dtype="float32") # This can be any optimizer supported by dygraph. adam = fluid.optimizer.Adam(learning_rate = 0.01, parameter_list = linear.parameters()) out = linear(a) out.backward() adam.minimize(out) adam.clear_gradients()
-
current_step_lr
() Note
This API is ONLY available in Dygraph mode
Get current step learning rate. The return value is all the same When LearningRateDecay is not used, otherwise return the step learning rate.
- Returns
The learning rate of the current step.
- Return type
float
Examples
import paddle.fluid as fluid import numpy as np # example1: LearningRateDecay is not used, return value is all the same with fluid.dygraph.guard(): emb = fluid.dygraph.Embedding([10, 10]) adam = fluid.optimizer.Adam(0.001, parameter_list = emb.parameters()) lr = adam.current_step_lr() print(lr) # 0.001 # example2: PiecewiseDecay is used, return the step learning rate with fluid.dygraph.guard(): inp = np.random.uniform(-0.1, 0.1, [10, 10]).astype("float32") linear = fluid.dygraph.nn.Linear(10, 10) inp = fluid.dygraph.to_variable(inp) out = linear(inp) loss = fluid.layers.reduce_mean(out) bd = [2, 4, 6, 8] value = [0.2, 0.4, 0.6, 0.8, 1.0] adam = fluid.optimizer.Adam(fluid.dygraph.PiecewiseDecay(bd, value, 0), parameter_list=linear.parameters()) # first step: learning rate is 0.2 np.allclose(adam.current_step_lr(), 0.2, rtol=1e-06, atol=0.0) # True # learning rate for different steps ret = [0.2, 0.2, 0.4, 0.4, 0.6, 0.6, 0.8, 0.8, 1.0, 1.0, 1.0, 1.0] for i in range(12): adam.minimize(loss) lr = adam.current_step_lr() np.allclose(lr, ret[i], rtol=1e-06, atol=0.0) # True
-
minimize
(loss, startup_program=None, parameter_list=None, no_grad_set=None) Add operations to minimize
loss
by updatingparameter_list
.- Parameters
loss (Variable) – A
Variable
containing the value to minimize.startup_program (Program, optional) – Program for initializing parameters in
parameter_list
. The default value is None, at this time default_startup_program will be used.parameter_list (list, optional) – List of
Variable
orVariable.name
to update to minimizeloss
. The default value is None, at this time all parameters will be updated.no_grad_set (set, optional) – Set of
Variable
orVariable.name
that don’t need to be updated. The default value is None.
- Returns
tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is
Parameter
, grad is the gradient value corresponding to the parameter. The returned tuple can be passed tofetch_list
inExecutor.run()
to indicate program pruning. If so, the program will be pruned byfeed
andfetch_list
before run, see details inExecutor
.- Return type
tuple
Examples
Please refer to the example of current Optimizer.
-
set_dict
(state_dict) Load optimizer state dict. For Adam optimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.
- Parameters
state_dict (dict) – Dict contains all the Variable needed by optimizer
- Returns
None
Examples
with fluid.dygraph.guard(): emb = fluid.dygraph.Embedding([10, 10]) state_dict = emb.state_dict() fluid.save_dygraph(state_dict, "paddle_dy") adam = fluid.optimizer.Adam(learning_rate=fluid.layers.noam_decay( 100, 10000), parameter_list=emb.parameters()) state_dict = adam.state_dict() fluid.save_dygraph(state_dict, "paddle_dy") para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy") adam.set_dict(opti_state_dict)
-
state_dict
() Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam optimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimizer never be called(minimize function), the state_dict is empty.
Args: None :returns: dict contains all the variable used by optimizer :rtype: state_dict(dict)
Examples
import paddle.fluid as fluid with fluid.dygraph.guard(): emb = fluid.dygraph.Embedding([10, 10]) adam = fluid.optimizer.Adam(0.001, parameter_list=emb.parameters()) state_dict = adam.state_dict()