paddle.vision.ops. yolo_box ( x, img_size, anchors, class_num, conf_thresh, downsample_ratio, clip_bbox=True, name=None, scale_x_y=1.0, iou_aware=False, iou_aware_factor=0.5 ) [source]

This operator generates YOLO detection boxes from output of YOLOv3 network.

The output of previous network is in shape [N, C, H, W], while H and W should be the same, H and W specify the grid size, each grid point predict given number boxes, this given number, which following will be represented as S, is specified by the number of anchors. In the second dimension(the channel dimension), C should be equal to S * (5 + class_num) if iou_aware is false, otherwise C should be equal to S * (6 + class_num). class_num is the object category number of source dataset(such as 80 in coco dataset), so the second(channel) dimension, apart from 4 box location coordinates x, y, w, h, also includes confidence score of the box and class one-hot key of each anchor box.

Assume the 4 location coordinates are \(t_x, t_y, t_w, t_h\), the box predictions should be as follows:

$$ b_x = \sigma(t_x) + c_x $$ $$ b_y = \sigma(t_y) + c_y $$ $$ b_w = p_w e^{t_w} $$ $$ b_h = p_h e^{t_h} $$

in the equation above, \(c_x, c_y\) is the left top corner of current grid and \(p_w, p_h\) is specified by anchors.

The logistic regression value of the 5th channel of each anchor prediction boxes represents the confidence score of each prediction box, and the logistic regression value of the last class_num channels of each anchor prediction boxes represents the classifcation scores. Boxes with confidence scores less than conf_thresh should be ignored, and box final scores is the product of confidence scores and classification scores.

$$ score_{pred} = score_{conf} * score_{class} $$

  • x (Tensor) – The input tensor of YoloBox operator is a 4-D tensor with shape of [N, C, H, W]. The second dimension(C) stores box locations, confidence score and classification one-hot keys of each anchor box. Generally, X should be the output of YOLOv3 network. The data type is float32 or float64.

  • img_size (Tensor) – The image size tensor of YoloBox operator, This is a 2-D tensor with shape of [N, 2]. This tensor holds height and width of each input image used for resizing output box in input image scale. The data type is int32.

  • anchors (list|tuple) – The anchor width and height, it will be parsed pair by pair.

  • class_num (int) – The number of classes.

  • conf_thresh (float) – The confidence scores threshold of detection boxes. Boxes with confidence scores under threshold should be ignored.

  • downsample_ratio (int) – The downsample ratio from network input to yolo_box operator input, so 32, 16, 8 should be set for the first, second, and thrid yolo_box layer.

  • clip_bbox (bool, optional) – Whether clip output bonding box in img_size boundary. Default true.

  • name (str, optional) – The default value is None. Normally there is no need for user to set this property. For more information, please refer to Name.

  • scale_x_y (float, optional) – Scale the center point of decoded bounding box. Default 1.0

  • iou_aware (bool, optional) – Whether use iou aware. Default false.

  • iou_aware_factor (float, optional) – iou aware factor. Default 0.5.


A 3-D tensor with shape [N, M, 4], the coordinates of boxes, and a 3-D tensor with shape [N, M, class_num], the classification scores of boxes.

Return type



>>> import paddle

>>> x = paddle.rand([2, 14, 8, 8]).astype('float32')
>>> img_size = paddle.ones((2, 2)).astype('int32')
>>> boxes, scores = paddle.vision.ops.yolo_box(x,
...                                             img_size=img_size,
...                                             anchors=[10, 13, 16, 30],
...                                             class_num=2,
...                                             conf_thresh=0.01,
...                                             downsample_ratio=8,
...                                             clip_bbox=True,
...                                             scale_x_y=1.)