GradientClipByNorm¶
-
class
paddle.fluid.clip.
GradientClipByNorm
(clip_norm, need_clip=None)[source] Limit the l2 norm of multi-dimensional Tensor \(X\) to
clip_norm
.If the l2 norm of \(X\) is greater than
clip_norm
, \(X\) will be compressed by a ratio.If the l2 norm of \(X\) is less than or equal to
clip_norm
, nothing will be done.
The multidimensional Tensor \(X\) is not passed from this class, but the gradients of all parameters in
Program
. Ifneed_clip
is not None, then only part of gradients can be selected for gradient clipping.Gradient clip will takes effect after being set in
optimizer
, see the documentoptimizer
(for example: SGDOptimizer).The clipping formula is:
\[\begin{split}Out = \left \{ \begin{aligned} & X & & if (norm(X) \leq clip\_norm) \\ & \frac{clip\_norm*X}{norm(X)} & & if (norm(X) > clip\_norm) \\ \end{aligned} \right.\end{split}\]where \(norm(X)\) represents the L2 norm of \(X\).
\[norm(X) = ( \sum_{i=1}^{n}|x\_i|^2)^{ \frac{1}{2}}\]- Parameters
clip_norm (float) – The maximum norm value.
need_clip (function, optional) – Type: function. This function accepts a
Parameter
and returnsbool
(True: the gradient of thisParameter
need to be clipped, False: not need). Default: None, and gradients of all parameters in the network will be clipped.
Examples
# use for Static mode import paddle import paddle.fluid as fluid import numpy as np main_prog = fluid.Program() startup_prog = fluid.Program() with fluid.program_guard( main_program=main_prog, startup_program=startup_prog): image = fluid.data( name='x', shape=[-1, 2], dtype='float32') predict = fluid.layers.fc(input=image, size=3, act='relu') # Trainable parameters: fc_0.w.0, fc_0.b.0 loss = fluid.layers.mean(predict) # Clip all parameters in network: clip = fluid.clip.GradientClipByNorm(clip_norm=1.0) # Clip a part of parameters in network: (e.g. linear_0.w_0) # pass a function(fileter_func) to need_clip, and fileter_func receive a Parameter, and return bool # def fileter_func(Parameter): # # It can be easily filtered by Parameter.name (name can be set in fluid.ParamAttr, and the default name is fc_0.w_0, fc_0.b_0) # return Parameter.name=="fc_0.w_0" # clip = fluid.clip.GradientClipByNorm(clip_norm=1.0, need_clip=fileter_func) sgd_optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.1, grad_clip=clip) sgd_optimizer.minimize(loss) place = fluid.CPUPlace() exe = fluid.Executor(place) x = np.random.uniform(-100, 100, (10, 2)).astype('float32') exe.run(startup_prog) out = exe.run(main_prog, feed={'x': x}, fetch_list=loss) # use for Dygraph mode import paddle import paddle.fluid as fluid with fluid.dygraph.guard(): linear = fluid.dygraph.Linear(10, 10) # Trainable: linear_0.w.0, linear_0.b.0 inputs = fluid.layers.uniform_random([32, 10]).astype('float32') out = linear(fluid.dygraph.to_variable(inputs)) loss = fluid.layers.reduce_mean(out) loss.backward() # Clip all parameters in network: clip = fluid.clip.GradientClipByNorm(clip_norm=1.0) # Clip a part of parameters in network: (e.g. linear_0.w_0) # pass a function(fileter_func) to need_clip, and fileter_func receive a ParamBase, and return bool # def fileter_func(ParamBase): # # It can be easily filtered by ParamBase.name(name can be set in fluid.ParamAttr, and the default name is linear_0.w_0, linear_0.b_0) # return ParamBase.name == "linear_0.w_0" # # Note: linear.weight and linear.bias can return the weight and bias of dygraph.Linear, respectively, and can be used to filter # return ParamBase.name == linear.weight.name # clip = fluid.clip.GradientClipByNorm(clip_norm=1.0, need_clip=fileter_func) sgd_optimizer = fluid.optimizer.SGD( learning_rate=0.1, parameter_list=linear.parameters(), grad_clip=clip) sgd_optimizer.minimize(loss)