paddle.distributed.sharding. group_sharded_parallel ( model, optimizer, level, scaler=None, group=None, offload=False, sync_buffers=False, buffer_max_size=8388608, segment_size=1048576, sync_comm=False ) [source]

Use group_sharded_parallel can perform group shared configuration on the model, optimizer and GradScaler. Level has three string options, ‘os’, ‘os_g’ and ‘p_g_os’ corresponds to three different usage scenarios: optimizer state segmentation, optimizer state + gradient segmentation, and parameter + gradient + optimizer state segmentation. Usually, optimizer state + gradient segmentation is actually a re optimization of optimizer state segmentation, so optimizer state + gradient segmentation can be used to realize optimizer state segmentation.

  • model (Layer) – The layer to be wrapped with group_sharded_parallel.

  • optimizer (Optimizer) – The optimizer to be wrapped with group_sharded_parallel.

  • level (str) – The different level of the group sharded. Such as os, os_g, p_g_os.

  • scaler (GradScaler, optional) – If AMP is used, you need to pass GradScaler. Defaults to None, indicating that GradScaler is not used.

  • group (Group, optional) – The group instance. Defaults to None, indicating that the default environment group is used.

  • offload (bool, optional) – Whether to use the offload function. Defaults to False, which means that the offload function is not used.

  • sync_buffers (bool, optional) – Whether to broadcast model buffers. It is generally used when there are registered model buffers. Defaults to False, indicating that model buffers are not used.

  • buffer_max_size (int, optional) – The max size of the buffer used to integrate gradient in os_g. The larger the size, the more GPU memory will be used. Defaults to 2**23, which means that the dimension of the buffer is 2**23.

  • segment_size (int, optional) – The smallest size of parameter to be sharded in p_g_os. Defaults to 2**20, indicating that the dimension of the minimum segmented parameter is 2**20.

  • sync_comm (bool, optional) – Whether to use synchronous communication, only in p_g_os used. Defaults to False, indicating that asynchronous communication is used.


A wrapper for group sharded given model. optimizer: A wrapper for group sharded given optimizer. scaler: A wrapper for group sharded given scaler.

Return type



# required: distributed
import paddle
from paddle.fluid.dygraph.nn import Linear
from paddle.distributed import fleet
from paddle.distributed.sharding import group_sharded_parallel

group = paddle.distributed.new_group([0, 1])
model = Linear(1000, 1000)

clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
optimizer = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters(), weight_decay=0.00001, grad_clip=clip)

# wrap sharding model, optimizer and scaler
model, optimizer, scaler = group_sharded_parallel(model, optimizer, "p_g", scaler=scaler)

img, label = data
label.stop_gradient = True
img.stop_gradient = True

out = model(img)
loss = paddle.nn.functional.cross_entropy(input=out, label=label)