TensorCheckerConfig¶

class paddle.amp.debugging. TensorCheckerConfig ( enable: bool, debug_mode: DebugMode = DebugMode.CHECK_NAN_INF_AND_ABORT, output_dir: str | None = None, checked_op_list: Sequence[str] | None = None, skipped_op_list: Sequence[str] | None = None, debug_step: Sequence[int] | None = None, stack_height_limit: int = 1 ) [source]

The purpose of this class is to collect the configuration for checking NaN and Inf values in the tensors of a module or operator. It takes the following arguments:

Parameters

enable (bool) – Indicating whether to enable the detection of NaN and Inf values in tensors. The default value is False, which means that these tools will not be used.
debug_mode (DebugMode, optional) – A parameter that determines the type of debugging to be used. Default is DebugMode.CHECK_NAN_INF_AND_ABORT.
output_dir (string|None, optional) – The path to store collected data. If this parameter is set to None, the data will be printed to the terminal. Default is None.
checked_op_list (list|tuple|None, optional) – Specifies a list of operators that need to be checked during program execution, for example, checked_op_list=[‘elementwise_add’, ‘conv2d’], indicating that the output results of elementwise_add and conv2d should be checked for nan/inf during program execution. Default is None.
skipped_op_list (list|tuple|None, optional) – Specifies a list of operators that do not need to be checked during program execution, for example, skipped_op_list=[‘elementwise_add’, ‘conv2d’], indicating that the output results of elementwise_add and conv2d should not be checked for nan/inf during program execution. None is None.
debug_step (list|tuple|None, optional) – A list or tuple used primarily for nan/inf checking during model training. For example, debug_step=[1,5] indicates that nan/inf checking should only be performed on model training iterations 1 to 5. Default is None.
stack_height_limit (int, optional) – An integer value specifying the maximum depth of the call stack. This feature supports printing the call stack at the error location. Currently, only enabling or disabling call stack printing is supported. If you want to print the corresponding C++ call stack when NaN is detected in GPU Kernel, set stack_height_limit to 1, otherwise set it to 0. Default is 1.

Examples

>>> import paddle

>>> checker_config = paddle.amp.debugging.TensorCheckerConfig(enable=True, debug_mode=paddle.amp.debugging.DebugMode.CHECK_NAN_INF)
>>> paddle.amp.debugging.enable_tensor_checker(checker_config)

>>> x = paddle.to_tensor([1, 0, 3], place=paddle.CPUPlace(), dtype='float32', stop_gradient=False)
>>> y = paddle.to_tensor([0.2, 0, 0.5], place=paddle.CPUPlace(), dtype='float32')
>>> res = paddle.pow(x, y)
>>> paddle.autograd.backward(res, retain_graph=True)
>>> paddle.amp.debugging.disable_tensor_checker()

>>> # [PRECISION] [ERROR] in [device=cpu, op=elementwise_pow_grad, tensor=, dtype=fp32], numel=3, num_nan=1, num_inf=0, num_zero=0, max=2.886751e-01, min=2.000000e-01, mean=-nan

>>> # when DebugMode.CHECK_NAN_INF_AND_ABORT and stack_height_limit = 1
>>> # Traceback (most recent call last):
>>> #     res = paddle.pow(x, y)
>>> #   File "/usr/local/lib/python3.8/dist-packages/paddle/tensor/math.py", line 447, in pow
>>> #     return _C_ops.elementwise_pow(x, y)