class paddle.fluid.optimizer.DGCMomentumOptimizer(learning_rate, momentum, rampup_begin_step, rampup_step=1, sparsity=[0.999], use_nesterov=False, local_grad_clip_norm=None, num_trainers=None, regularization=None, name=None)[source]

DGC (Deep Gradient Compression) Momentum Optimizer. Original paper is https://arxiv.org/abs/1712.01887

DGC reduces the communication bandwidth by sending only the important gradients (sparse update): only gradients larger than a threshold are transmitted.

To avoid losing information, DGC accumulates the rest of the gradients locally.

Eventually, these gradients become large enough to be transmitted.

Thus, DGC sends the large gradients immediately but eventually sends all of the gradients over time.

To ensure no loss of accuracy, DGC employs momentum correction and local gradient clipping on top of the gradient sparsification to maintain model performance.

DGC also uses momentum factor masking and warmup training to overcome the staleness problem caused by reduced communication.

This optimizer will do two things:

  1. Compress the gradient by get TopK import value from tensor and use it for allreduce to reduce network bandwidth.

  2. Call momentum to optimize the cost.

  • learning_rate (float|Variable) – The learning rate used to update parameters. It can be a float value or a Variable with one float value as a data element.

  • momentum (float) – Momentum factor.

  • rampup_begin_step (int) – The beginning step from which gradient compression is implemented.

  • rampup_step (int) – Time steps used in sparsity warm-up periods. Default is 1. For example, if the sparsity is [0.75, 0.9375, 0.984375, 0.996, 0.999], and the rampup_step is 100, it will use 0.75 at 0~19 steps, and 0.9375 at 20~39 steps, and so on. And when reach sparsity array ends, it will use 0.999 then and after.

  • sparsity (list[float]) – Get top important element from gradient tensor, the ratio is (1 - current sparsity). Default is [0.999]. For example, if the sparsity is [0.99, 0.999], the top [1%, 0.1%] important element will be transmitted.

  • use_nesterov (bool) – Enables Nesterov momentum. True means use Nesterov. Default is False.

  • local_grad_clip_norm (float, optional) – Local gradient clip norm value. Optional, default is None, represent no need clip.

  • num_trainers (int, optional) – The number of training nodes. Optional, default is None.

  • regularization (WeightDecayRegularizer, optional) – A Regularizer, such as L2DecayRegularizer. Optional, default is None.

  • name (str, optional) – This parameter is used by developers to print debugging information. For details, please refer to Name. Default is None.


import paddle.fluid as fluid
optimizer = fluid.optimizer.DGCMomentumOptimizer(
            sparsity=[0.999, 0.999])
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

  • loss (Variable) – A Variable containing the value to minimize.

  • startup_program (Program, optional) – Program for initializing parameters in parameter_list. The default value is None, at this time default_startup_program will be used.

  • parameter_list (list, optional) – List of Variable names to update to minimize loss. The default value is None, at this time all parameters will be updated.

  • no_grad_set (set, optional) – Set of Variable objects that don’t need to be updated. The default value is None.

  • grad_clip (GradClipBase, optional) – Gradient clipping strategy, static graph mode does not need to use this argument. Currently, this argument only supports gradient clipping in dygraph mode. In the future, this argument my be adjusted. The default value is None.


tuple (optimize_ops, params_grads), A list of operators appended by minimize and a list of (param, grad) variable pairs, param is Parameter, grad is the gradient value corresponding to the parameter.

Return type



Please refer to the example of current Optimizer.


Load optimizer state dict. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be changed.


state_dict (dict) – Dict contains all the Variable needed by optimizer




with fluid.dygraph.guard():
    emb = fluid.dygraph.Embedding( "emb", [10, 10])

    state_dict = emb.state_dict()
    fluid.save_dygraph( state_dict, "paddle_dy")

    adam = fluid.optimizer.Adam( learning_rate = fluid.layers.noam_decay( 100, 10000) )
    state_dict = adam.state_dict()
    fluid.save_dygraph( state_dict, "padle_dy")

    para_state_dict, opti_state_dict = fluid.load_dygraph( "paddle_dy")

    adam.set_dict( opti_state_dict )

Get state dict information from optimizer. It contain all the variable used by optimizer. For Adam opimizer, contains beta1, beta2, momentum etc. If LearningRateDecay have been used, global_step will be include in state dict. If the optimzier never be called(minimize function), the state_dict is empty.

Args: None :returns: dict contains all the variablel used by optimizer :rtype: state_dict(dict)


import paddle.fluid as fluid
adam = fluid.optimizer.Adam(0.001)
state_dict = adam.state_dict()