group_sharded_parallel

paddle.distributed.sharding. group_sharded_parallel ( model, optimizer, level, scaler=None, group=None, offload=False, sync_buffers=False, buffer_max_size=8388608, segment_size=1048576, sync_comm=False, dp_group=None, exclude_layer=None ) [source]

Use group_sharded_parallel can perform group shared configuration on the model, optimizer and GradScaler. Level has three string options, ‘os’, ‘os_g’ and ‘p_g_os’ corresponds to three different usage scenarios: optimizer state segmentation, optimizer state + gradient segmentation, and parameter + gradient + optimizer state segmentation. Usually, optimizer state + gradient segmentation is actually a re optimization of optimizer state segmentation, so optimizer state + gradient segmentation can be used to realize optimizer state segmentation.

Parameters
  • model (Layer) – The layer to be wrapped with group_sharded_parallel.

  • optimizer (Optimizer) – The optimizer to be wrapped with group_sharded_parallel.

  • level (str) – The different level of the group sharded. Such as os, os_g, p_g_os.

  • scaler (GradScaler, optional) – If AMP is used, you need to pass GradScaler. Defaults to None, indicating that GradScaler is not used.

  • group (Group, optional) – The group instance. Defaults to None, indicating that the default environment group is used.

  • offload (bool, optional) – Whether to use the offload function. Defaults to False, which means that the offload function is not used.

  • sync_buffers (bool, optional) – Whether to broadcast model buffers. It is generally used when there are registered model buffers. Defaults to False, indicating that model buffers are not used.

  • buffer_max_size (int, optional) – The max size of the buffer used to integrate gradient in os_g. The larger the size, the more GPU memory will be used. Defaults to 2**23, which means that the dimension of the buffer is 2**23.

  • segment_size (int, optional) – The smallest size of parameter to be sharded in p_g_os. Defaults to 2**20, indicating that the dimension of the minimum segmented parameter is 2**20.

  • sync_comm (bool, optional) – Whether to use synchronous communication, only in p_g_os used. Defaults to False, indicating that asynchronous communication is used.

  • dp_group (Group, optional) – dp communication group, support to combine stage2 or stage3 with dp hybrid communication.

  • exclude_layer (list, optional) – exclude some layers for slicing for sharding stage3, for example, exclude_layer=[“GroupNorm”, id(model.gpt.linear)], exclude_layer must contain the layers’ name or one layer’s id.

Returns

A wrapper for group sharded given model. optimizer: A wrapper for group sharded given optimizer. scaler: A wrapper for group sharded given scaler.

Return type

model

Examples

>>> 
>>> import paddle
>>> from paddle.nn import Linear
>>> from paddle.distributed import fleet
>>> from paddle.distributed.sharding import group_sharded_parallel

>>> fleet.init(is_collective=True)
>>> group = paddle.distributed.new_group([0, 1])
>>> model = Linear(1000, 1000)

>>> clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
>>> optimizer = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters(), weight_decay=0.00001, grad_clip=clip)

>>> # wrap sharding model, optimizer and scaler
>>> model, optimizer, scaler = group_sharded_parallel(model, optimizer, "p_g", scaler=scaler)

>>> img, label = data
>>> label.stop_gradient = True
>>> img.stop_gradient = True

>>> out = model(img)
>>> loss = paddle.nn.functional.cross_entropy(input=out, label=label)

>>> loss.backward()
>>> optimizer.step()
>>> optimizer.clear_grad()