Strategy

class paddle.distributed. Strategy ( config=None ) [source]

The Strategy object is used to configure the parallelization and optimization strategies for static graph. Currently supports configuring sharding, fused_passes, gradient_merge and pipeline. More strategies will be supported in the future.

sharding is used to configure the sharding states of the optimizer, for saving the GPU memory.

fused_passes is used to configure the fusion of the computation in the model.

gradient_merge is used to configure the gradient merge strategy in training.

pipeline is used to configure the pipeline parallelism strategy.

Parameters

config (dict|None, optional) – The user-defined configurations. If config is None, use default configurations. If it is a dict, the items inside the dict will be used to set the configurations, and the others remain the default values.

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.sharding.enable = True
>>> strategy.sharding.stage = 2
>>> strategy.sharding.degree = 2

>>> strategy.gradient_merge.enable = True
>>> strategy.gradient_merge.k_steps = 2
>>> strategy.gradient_merge.avg = False

>>> strategy.pipeline.enable = True
>>> strategy.pipeline.schedule_mode = "1F1B" # default is "1F1B"
>>> strategy.pipeline.micro_batch_size = 2
property sharding [source]

sharding is used to configure the sharding states of the optimizer, containing following configs:

enable (bool): whether to enable sharding. Default: False.

stage (int): can be set to 1, 2 or 3. 1 indicates the optimizer states segmentation, 2 indicates optimizer states and gradient segmentation, 3 indicates the segmentation of optimizer states, gradient and parameters. Default: 1.

degree (int): the number of segmentation pieces. Default: 8.

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.sharding.enable = True
>>> strategy.sharding.stage = 2
>>> strategy.sharding.degree = 2
property gradient_merge

gradient_merge is used to configure the gradient merge strategy in training, containing following configs:

enable (bool): whether to enable gradient merge. Default: False.

k_steps (int): the number of steps for merging gradients. Default: 1.

avg (bool): whether to average the gradients of each step. Default: True.

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.gradient_merge.enable = True
>>> strategy.gradient_merge.k_steps = 2
>>> strategy.gradient_merge.avg = True
property fused_passes

fused_passes is used to configure the fusion of the computation in the model, containing following configs:

enable (bool): whether to enable fused passes. Default: False.

gemm_epilogue (bool): whether to fuse matmul and add computation in the Linear layer. Default: False

“dropout_add” (bool): whether to fuse dropout and add computation. Default: False.

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.fused_passes.enable = True
>>> strategy.fused_passes.gemm_spilogue = True
>>> strategy.fused_passes.dropout_add = True
property pipeline

pipeline is used to configure the pipeline parallelism, containing following configs:

enable (bool): whether to enable pipeline parallelism. Default: False.

schedule_mode (str): the scheduling mode of pipeline parallelism. Default: “1F1B”.

micro_batch_size (int): the size of each micro-batch inside a mini-batch. Default: 1.

accumulate_steps (int): number of steps for accumulating. Default: 1.

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.pipeline.enable = True
>>> strategy.pipeline.micro_batch_size = 2
property amp

amp is used to configure the amp, containing following configs:

enable (bool): whether to enable AMP. Default: False. dtype, (str): the data type of AMP. Default: “float16”. level, (str): the level of AMP. Default: “O1”. init_loss_scaling, (float): the initial value of loss scaling. Default: 32768.0 incr_every_n_steps, (int): the number of steps for increasing loss scaling. Default: 1000 decr_every_n_nan_or_inf, (int): the number of steps for decreasing loss scaling. Default: 2 incr_ratio, (float): the ratio for increasing loss scaling. Default: 2.0 decr_ratio, (float): the ratio for decreasing loss scaling. Default: 2.0 use_dynamic_loss_scaling, (bool): whether to use dynamic loss scaling. Default: False custom_white_list, (list): the list of names for which AMP will be applied. Default: [] custom_black_list, (list): the list of names for which AMP will not be applied. Default: [] custom_black_varnames, (list): the list of names for which AMP will not be applied. Default: [] use_fp16_guard, (bool): whether to use fp16 guard. Default: False use_bf16_guard, (bool): whether to use bf16 guard. Default: False use_master_grad, (bool): whether to use master grad. Default: False

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.amp.enable = True
>>> strategy.amp.dtype = "float16"
>>> strategy.amp.level = "O2"