Strategy

class paddle.distributed. Strategy ( config=None ) [source]

The Strategy object is used to configure the parallelization and optimization strategies for static graph. Currently contains configuring sharding, fused_passes, gradient_merge and pipline. More strategies will be supported in the future.

sharding is used to cnofigure the sharding states of the optimizer, for saving the GPU memory.

fused_passes is used to configure the fusion of the computation in the model.

gradient_merge is used to configure the gradient merge strategy in training.

pipeline is used to configure the pipeline parallelism strategy.

Parameters
  • config (dict|None, optional) – If config is None, the default

  • dict (configurations will be set. If it is a) –

  • inside (the itmes) –

  • configurations (the dict will be used to set the) –

  • remain (the others) –

  • values. (the default) –

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.sharding.enable = True
>>> strategy.sharding.stage = 2
>>> strategy.sharding.degree = 2

>>> strategy.gradient_merge.enable = True
>>> strategy.gradient_merge.k_steps = 2
>>> strategy.gradient_merge.avg = False

>>> strategy.pipeline.enable = True
>>> strategy.pipeline.schedule_mode = "1F1B" # default is "1F1B"
>>> strategy.pipeline.micro_batch_size = 2
property sharding [source]

sharding is used to cnofigure the sharding states of the optimizer, containing following configs:

enable (bool): whether to enable sharding. Default: False.

stage (int): can be set to 1, 2 or 3. 1 indicates the optimizer states segmentation, 2 indicates optimizer states and gradient segmentation, 3 indicates the segmentation of optimizer states, gradient and parameters. Default: 1.

degree (int): the number of segmentation pieces. Default: 8.

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.sharding.enable = True
>>> strategy.sharding.stage = 2
>>> strategy.sharding.degree = 2
property gradient_merge

gradient_merge is used to configure the gradient merge strategy in training, containing following configs:

enable (bool): whether to enable gradient merge. Default: False.

k_steps (int): the number of steps for merging gradients. Default: 1.

avg (bool): whether to average the gradients of each step. Default: True.

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.gradient_merge.enable = True
>>> strategy.gradient_merge.k_steps = 2
>>> strategy.gradient_merge.avg = True
property fused_passes

fused_passes is used to configure the fusion of the computation in the model, containing following configs:

enable (bool): whether to enable fused passes. Default: False.

gemm_epilogue (bool): whether to fuse matmul and add computation in the Linear layer. Default: False

“dropout_add” (bool): whether to fuse dropout and add computation. Default: False.

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.fused_passes.enable = True
>>> strategy.fused_passes.gemm_spilogue = True
>>> strategy.fused_passes.dropout_add = True
property pipeline

pipeline is used to configure the pipeline parallelism in training, containing following configs:

enable (bool): whether to enable pipeline parallelism. Default: False.

schedule_mode (str): the scheduling mode of pipeline parallelism. Default: “1F1B”.

micro_batch_size (int): the size of each micro-batch inside a mini-batch. Default: 1.

accumulate_steps (int): number of steps for accumulating. Default: 1.

Examples

>>> import paddle
>>> import paddle.distributed as dist

>>> strategy = dist.Strategy()

>>> strategy.pipeline.enable = True
>>> strategy.pipeline.micro_batch_size = 2