margin_cross_entropy

paddle.nn.functional. margin_cross_entropy ( logits, label, margin1=1.0, margin2=0.5, margin3=0.0, scale=64.0, group=None, return_softmax=False, reduction='mean' ) [source]
\[L=-\frac{1}{N}\sum^N_{i=1}\log\frac{e^{s(cos(m_{1}\theta_{y_i}+m_{2})-m_{3})}}{e^{s(cos(m_{1}\theta_{y_i}+m_{2})-m_{3})}+\sum^n_{j=1,j\neq y_i} e^{scos\theta_{y_i}}}\]

where the \(\theta_{y_i}\) is the angle between the feature \(x\) and the representation of class \(i\). The details of ArcFace loss could be referred to https://arxiv.org/abs/1801.07698.

Hint

The API supports single GPU and multi GPU, and don’t supports CPU. For data parallel mode, set group=False. For model parallel mode, set group=None or the group instance return by paddle.distributed.new_group. And logits.shape[-1] can be different at each rank.

Parameters
  • logits (Tensor) – shape[N, local_num_classes], the output of the normalized X multiply the normalized W. The logits is shard_logits when using model parallel.

  • label (Tensor) – shape[N] or shape[N, 1], the groud truth label.

  • margin1 (float, optional) – m1 of margin loss, default value is 1.0.

  • margin2 (float, optional) – m2 of margin loss, default value is 0.5.

  • margin3 (float, optional) – m3 of margin loss, default value is 0.0.

  • scale (float, optional) – s of margin loss, default value is 64.0.

  • group (Group, optional) – The group instance return by paddle.distributed.new_group or None for global default group or False for data parallel (do not communication cross ranks). Default is None.

  • return_softmax (bool, optional) – Whether return softmax probability. Default value is False.

  • reduction (str, optional) – The candicates are 'none' | 'mean' | 'sum'. If reduction is 'mean', return the average of loss; If reduction is 'sum', return the sum of loss; If reduction is 'none', no reduction will be applied. Default value is ‘mean’.

Returns

Tensor|tuple[Tensor, Tensor], return the cross entropy loss if

return_softmax is False, otherwise the tuple (loss, softmax), softmax is shard_softmax when using model parallel, otherwise softmax is in the same shape with input logits. If reduction == None, the shape of loss is [N, 1], otherwise the shape is [].

Examples

>>> 
>>> import paddle
>>> paddle.seed(2023)
>>> paddle.device.set_device('gpu')
>>> m1 = 1.0
>>> m2 = 0.5
>>> m3 = 0.0
>>> s = 64.0
>>> batch_size = 2
>>> feature_length = 4
>>> num_classes = 4

>>> label = paddle.randint(low=0, high=num_classes, shape=[batch_size], dtype='int64')

>>> X = paddle.randn(
...     shape=[batch_size, feature_length],
...     dtype='float64')
>>> X_l2 = paddle.sqrt(paddle.sum(paddle.square(X), axis=1, keepdim=True))
>>> X = paddle.divide(X, X_l2)

>>> W = paddle.randn(
...     shape=[feature_length, num_classes],
...     dtype='float64')
>>> W_l2 = paddle.sqrt(paddle.sum(paddle.square(W), axis=0, keepdim=True))
>>> W = paddle.divide(W, W_l2)

>>> logits = paddle.matmul(X, W)
>>> loss, softmax = paddle.nn.functional.margin_cross_entropy(
...     logits, label, margin1=m1, margin2=m2, margin3=m3, scale=s, return_softmax=True, reduction=None)
>>> print(logits)
Tensor(shape=[2, 4], dtype=float64, place=Place(gpu:0), stop_gradient=True,
       [[-0.59561850,  0.32797505,  0.80279214,  0.00144975],
        [-0.16265212,  0.84155098,  0.62008629,  0.79126072]])
>>> print(label)
Tensor(shape=[2], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [1, 0])
>>> print(loss)
Tensor(shape=[2, 1], dtype=float64, place=Place(gpu:0), stop_gradient=True,
       [[61.94391901],
        [93.30853839]])
>>> print(softmax)
Tensor(shape=[2, 4], dtype=float64, place=Place(gpu:0), stop_gradient=True,
       [[0.00000000, 0.00000000, 1.        , 0.00000000],
        [0.00000000, 0.96152676, 0.00000067, 0.03847257]])
>>> 
>>> # Multi GPU, test_margin_cross_entropy.py
>>> import paddle
>>> import paddle.distributed as dist
>>> paddle.seed(2023)
>>> strategy = dist.fleet.DistributedStrategy()
>>> dist.fleet.init(is_collective=True, strategy=strategy)
>>> rank_id = dist.get_rank()
>>> m1 = 1.0
>>> m2 = 0.5
>>> m3 = 0.0
>>> s = 64.0
>>> batch_size = 2
>>> feature_length = 4
>>> num_class_per_card = [4, 8]
>>> num_classes = paddle.sum(paddle.to_tensor(num_class_per_card))

>>> label = paddle.randint(low=0, high=num_classes.item(), shape=[batch_size], dtype='int64')
>>> label_list = []
>>> dist.all_gather(label_list, label)
>>> label = paddle.concat(label_list, axis=0)

>>> X = paddle.randn(
...     shape=[batch_size, feature_length],
...     dtype='float64')
>>> X_list = []
>>> dist.all_gather(X_list, X)
>>> X = paddle.concat(X_list, axis=0)
>>> X_l2 = paddle.sqrt(paddle.sum(paddle.square(X), axis=1, keepdim=True))
>>> X = paddle.divide(X, X_l2)

>>> W = paddle.randn(
...     shape=[feature_length, num_class_per_card[rank_id]],
...     dtype='float64')
>>> W_l2 = paddle.sqrt(paddle.sum(paddle.square(W), axis=0, keepdim=True))
>>> W = paddle.divide(W, W_l2)

>>> logits = paddle.matmul(X, W)
>>> loss, softmax = paddle.nn.functional.margin_cross_entropy(
...     logits, label, margin1=m1, margin2=m2, margin3=m3, scale=s, return_softmax=True, reduction=None)
>>> print(logits)
>>> print(label)
>>> print(loss)
>>> print(softmax)

>>> # python -m paddle.distributed.launch --gpus=0,1 --log_dir log test_margin_cross_entropy.py
>>> # cat log/workerlog.0
>>> # Tensor(shape=[4, 4], dtype=float64, place=Place(gpu:0), stop_gradient=True,
>>> #        [[-0.59561850,  0.32797505,  0.80279214,  0.00144975],
>>> #         [-0.16265212,  0.84155098,  0.62008629,  0.79126072],
>>> #         [-0.59561850,  0.32797505,  0.80279214,  0.00144975],
>>> #         [-0.16265212,  0.84155098,  0.62008629,  0.79126072]])
>>> # Tensor(shape=[4], dtype=int64, place=Place(gpu:0), stop_gradient=True,
>>> #        [5, 4, 5, 4])
>>> # Tensor(shape=[4, 1], dtype=float64, place=Place(gpu:0), stop_gradient=True,
>>> #        [[104.27437027],
>>> #         [113.40243782],
>>> #         [104.27437027],
>>> #         [113.40243782]])
>>> # Tensor(shape=[4, 4], dtype=float64, place=Place(gpu:0), stop_gradient=True,
>>> #        [[0.00000000, 0.00000000, 0.01210039, 0.00000000],
>>> #         [0.00000000, 0.96152674, 0.00000067, 0.03847257],
>>> #         [0.00000000, 0.00000000, 0.01210039, 0.00000000],
>>> #         [0.00000000, 0.96152674, 0.00000067, 0.03847257]])
>>> # cat log/workerlog.1
>>> # Tensor(shape=[4, 8], dtype=float64, place=Place(gpu:1), stop_gradient=True,
>>> #        [[-0.34913275, -0.35180883, -0.53976657, -0.75234331,  0.70534995,
>>> #           0.87157838,  0.31064437,  0.19537700],
>>> #         [-0.63941012, -0.05631600, -0.02561853,  0.09363013,  0.56571130,
>>> #           0.13611246,  0.08849565,  0.39219619],
>>> #         [-0.34913275, -0.35180883, -0.53976657, -0.75234331,  0.70534995,
>>> #           0.87157838,  0.31064437,  0.19537700],
>>> #         [-0.63941012, -0.05631600, -0.02561853,  0.09363013,  0.56571130,
>>> #           0.13611246,  0.08849565,  0.39219619]])
>>> # Tensor(shape=[4], dtype=int64, place=Place(gpu:1), stop_gradient=True,
>>> #        [5, 4, 5, 4])
>>> # Tensor(shape=[4, 1], dtype=float64, place=Place(gpu:1), stop_gradient=True,
>>> #        [[104.27437027],
>>> #         [113.40243782],
>>> #         [104.27437027],
>>> #         [113.40243782]])
>>> # Tensor(shape=[4, 8], dtype=float64, place=Place(gpu:1), stop_gradient=True,
>>> #        [[0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00002368, 0.98787593,
>>> #          0.00000000, 0.00000000],
>>> #         [0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00000002, 0.00000000,
>>> #          0.00000000, 0.00000000],
>>> #         [0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00002368, 0.98787593,
>>> #          0.00000000, 0.00000000],
>>> #         [0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00000002, 0.00000000,
>>> #          0.00000000, 0.00000000]])