Image Detection

PaddlePaddle Fluid implements several unique operators for image detection tasks. This article introduces related APIs grouped by diverse model types.

General operations

Some common operations in image detection are a series of operations on the bounding boxes, including:

  • Encoding and decoding of the bounding box : Conversion between encoding and decoding between the two kinds of boxes. For example, the training phase encodes the prior box and the ground-truth box to obtain the training target value. For API Reference, please refer to api_fluid_layers_box_coder

  • Compare the two bounding boxes and match them:

    • iou_similarity: Calculate the IOU value of the two sets of boxes. For API Reference, please refer to api_fluid_layers_iou_similarity

    • bipartite_match: Get the row with the largest distance in each column by the greedy binary matching algorithm. For API Reference, please refer to api_fluid_layers_bipartite_match

  • Get classification and regression target values ​​(target_assign) based on the bounding boxes and labels: Get the target values and corresponding weights by matched indices and negative indices. For API Reference, please refer to api_fluid_layers_target_assign

Faster RCNN

Faster RCNN is a typical dual-stage target detector. Compared with the traditional extraction method, the RPN network in Faster RCNN greatly improves the extraction efficiency by sharing convolution layer parameters, and proposes high-quality region proposals. The RPN network needs to compare the input anchor with the ground-truth value to generate a primary candidate region, and assigns a classification and regression value to the primary candidate box. The following four unique apis are required:

  • rpn_target_assign: Assign the classification and regression target values ​​of the RPN network to the anchor through the anchor and the ground-truth box. For API Reference, please refer to api_fluid_layers_rpn_target_assign

  • anchor_generator: Generate a series of anchors for each location. For API Reference, please refer to api_fluid_layers_anchor_generator

  • generate_proposal_labels: Get the classification and regression target values ​​of the RCNN part through the candidate box and the ground-truth box obtained by generate_proposals. For API Reference, please refer to api_fluid_layers_generate_proposal_labels

  • generate_proposals: Decode the RPN network output box and selects a new region proposal. For API Reference, please refer to api_fluid_layers_generate_proposals

SSD

SSD , the acronym for Single Shot MultiBox Detector, is one of the latest and better detection algorithms in the field of target detection. It has the characteristics of fast detection speed and high detection accuracy. Unlike the dual-stage detection method, the single-stage target detection does not perform regional proposals, but directly returns the target’s bounding box and classification probability from the feature map. The SSD network calculates the loss through six metrics of features maps and performs prediction. SSD requires the following five unique apis:

  • Prior Box: Generate a series of candidate boxes for each input position based on different parameters. For API Reference, please refer to api_fluid_layers_prior_box

  • multi_box_head : Get the position and confidence of different prior boxes. For API Reference, please refer to api_fluid_layers_multi_box_head

  • detection_output: Decode the prior box and obtains the detection result by multi-class NMS. For API Reference, please refer to api_fluid_layers_detection_output

  • ssd_loss: Calculate the loss by prediction value of position offset, confidence, bounding box position and ground-truth box position and label. For API Reference, please refer to api_fluid_layers_ssd_loss

  • detection map: Evaluate the SSD network model using mAP. For API Reference, please refer to api_fluid_layers_detection_map

OCR

Scene text recognition is a process of converting image information into a sequence of characters in the case of complex image background, low resolution, diverse fonts, random distribution and so on. It can be considered as a special translation process: translation of image input into natural language output. The OCR task needs to perform irregular transformation on the bounding box, which requires the following two APIs:

  • roi_perspective_transform: Make a perspective transformation on the input RoI. For API Reference, please refer to api_fluid_layers_roi_perspective_transform

  • polygon_box_transform: Coordinate transformation of the irregular bounding box. For API Reference, please refer to api_fluid_layers_polygon_box_transform