current position:Home>[intensive reading] object detection series (10) FPN: introducing multi-scale with feature pyramid

[intensive reading] object detection series (10) FPN: introducing multi-scale with feature pyramid

2022-05-15 07:34:28chaibubble

Target detection series : object detection (object detection) series ( One ) R-CNN:CNN The pioneering work of target detection object detection (object detection) series ( Two ) SPP-Net: Let convolution calculations share object detection (object detection) series ( 3、 ... and ) Fast R-CNN:end-to-end Happy training object detection (object detection) series ( Four ) Faster R-CNN: Yes RPN Of Fast R-CNN object detection (object detection) series ( 5、 ... and ) YOLO: Another way to open target detection object detection (object detection) series ( 6、 ... and ) SSD: Balance efficiency and accuracy object detection (object detection) series ( 7、 ... and ) R-FCN: Location sensitive Faster R-CNN object detection (object detection) series ( 8、 ... and ) YOLOv2: Better , faster , stronger object detection (object detection) series ( Nine ) YOLOv3: Take the words of a hundred families to grow into a family object detection (object detection) series ( Ten ) FPN: The feature pyramid is used to introduce multi-scale object detection (object detection) series ( 11、 ... and ) RetinaNet:one-stage The pinnacle of the detector object detection (object detection) series ( Twelve ) CornerNet:anchor free The beginning object detection (object detection) series ( 13、 ... and ) CenterNet:no Anchor,no NMS object detection (object detection) series ( fourteen )FCOS: Target detection is processed by image segmentation

Extended series of target detection : object detection (object detection) Extended series ( One ) Selective Search: Selective search algorithm object detection (object detection) Extended series ( Two ) OHEM: Online hard case mining object detection (object detection) Extended series ( 3、 ... and ) Faster R-CNN,YOLO,SSD,YOLOv2,YOLOv3 The difference in the loss function

Preface : The feature pyramid is used to introduce multi-scale

SSD The algorithm proves the effectiveness of multi-layer branches for target detection , Before that two-stage The target detection method has been optimized and improved for many generations , But there has been no multi-scale method . Finally in the FPN in ,two-stage Multi scale is introduced , And in SSD Based on the multi-layer branching method , The characteristic pyramid network is proposed .FPN My thesis is 《Feature Pyramid Networks for Object Detection》.

FPN principle

Design concept

The concept of feature pyramid has existed for a long time , For example, when target detection is still using sliding window , Feature pyramid can improve the problem of target scale change .

In the figure above, there are four pyramid related operations :

  • chart (a) It's an image pyramid , The pyramid is built only by scaling the image , Then extract features from images of different scales , In this way, the features are independent of each other , And it's very time consuming , Of course it and CNN It's nothing to do with half a cent ;
  • chart (b) Is a conventional convolution operation , Every layer of the pyramid is no longer an image , But the extracted features , Then make a prediction on the last layer of features ,YOLO and Faster R-CNN It's all like this , As long as it uses a single layer of information ;
  • chart (c) Sum graph (b) The forward process is consistent , The difference is when predicting , Pulled multi-layer branches , There is the concept of multi-scale , such as SSD;
  • chart (d) The bottom-up process is also consistent with the previous two , But beyond that , Yes, a reverse top-down process is added , And there is horizontal interaction between the two processes , The final prediction is carried out on the characteristic diagram of the second process , This structure is called characteristic pyramid network by the author . Then let's specify .

Advantages of feature pyramid

SSD It is the first to introduce multi-layer feature graph branching structure into target detection task , And this characteristic pyramid structure is actually the same as SSD The multi-layer branching structure is the same , So what's good about the feature pyramid ?

Above, SSD Structure diagram , First SSD The selected branch of is not shallow enough , The top floor is VGG-16 Of conv4-3 layer , This has arrived VGG-16 The penultimate convolution layer of ,SSD Of 6 Layer branches , yes SSD stay VGG-16 The additional layer added at the back was pulled out , The more advanced feature map information is proved to be useful for small target detection , however SDD Not used . In this case , similar SSD The structure of is directly pulled to the front feature graph to take branches ? It's not good to do it directly , Because there is too little semantic information in the front feature map , Therefore, we need a structure that can use the front position feature map , This structure is the feature pyramid , Compared with SSD Multilayer branches of , The characteristic pyramid structure has two differences :

  • A layer by layer sampling operation from top to bottom
  • Transverse 1\times1 Convolution connection

The up sampling operation is obtained by deconvolution layer by layer from the top-level feature map , Deconvolution restores the size of the top-level feature graph , You can also restore the semantic information extracted from the top level , This information ignores the background class in the image , And restore the foreground object to the corresponding position . This is why deconvolution plays an important role in image segmentation . But there must be information loss in this kind of restoration , It's like we shrink the image first 2 times , Then zoom back , It will definitely become blurred . Small objects that have disappeared in the top-level feature map because of the down sampling operation , It won't be restored because of deconvolution . Therefore, it is also necessary to match the feature map in the bottom-up path , therefore FPN The two characteristic maps are fused ,1\times1 The purpose of convolution is to lift the channel , The feature map can be added to the corresponding positions . It looks like this :

After feature fusion , Before the predicted output , Each layer will also do a convolution without down sampling , In order to reduce the confusion caused by the fusion of upper sampling layer and original feature layer .

That's the structure , Give Way FPN High resolution 、 Feature map and low resolution of low-level semantic information 、 The combination of feature maps of high-level semantic information .

Speaking of this , In fact, if you rotate the figure of the feature pyramid 180 degree , Let it bottom up 、 Top down , You will find , This is clearly a UNet.╮( ̄▽  ̄)╭ But this also proves one point , Both deconvolution and lateral connection play a role in semantic segmentation , It also works in target detection .

FPN be applied to RPN

FPN To apply to RPN Make regional suggestions in , The following figure is a traditional RPN, It has only one layer of feature map ,RPN Use... On this feature map 3\times3\times256 Convolution kernel , A Shared 256 individual . Then the characteristic of each sliding output of convolution kernel is 1\times1\times256 , That is to say, in the figure below 256-d, Then the feature has two branches :

  • The first branch (reg layer) use 4k individual 1\times1\times256 Convolution check of 256-d Convolution , The final output 4k Number , there 4 Is the parameter of a suggestion box , namely (x,y,w,h);
  • The second branch (cls layer) use 2k individual 1\times1\times256 Convolution kernel convolution , The final output 2k Number , there 2 Is whether there are objects in the area , namely (object,non-object) The dichotomous problem of .

and FPN If you want to RPN Add , Just add layers ,FPN in RPN Of backbone yes ResNet, Here is ResNet Network structure :

ResNet Structure from 18 Layer to 152 layer , No matter how the number of layers changes , Their sampling positions and multiples are consistent , The first convolution layer completes a down sampling , After that, each residual group is also down sampled 2 times , The position of down sampling is conv2_x,conv3_x,conv4_x,conv5_x after , They are defined as {C_{2},C_{3},C_{4},C_{5}} , The magnification corresponding to the down sampling is 4,8,16,32 ,FPN The number of pyramid layers corresponds to the four layers , The top floor is C_{5} Output after , The bottom is C_{2} Output . The hierarchical structure has , The rest is how to choose Anchor,Faster R-CNN I chose 9 Kind of Anchor, They are three proportions and three scales , All on the same feature map , But in FPN in , Adopted and SSD A similar strategy , Different scales are allocated to different layers , Not on a layer . So finally FPN Of Anchor Scale is 32^2,64^2,128^2,256^2,512^2 , The proportion is still 1:1,2:1,1:2, therefore FPN Will produce 15 Kind of Anchor. But the problem is , The feature map shows that there are only four floors , Why does the scale have 5 individual ? This is because there is one more at the top s=2 The sliding window , Make the feeling field of this sliding window bigger again 2 times , So there was 512^2 This size .

FPN Every floor has 3 individual Anchor, Altogether 4 layer , The last layer slides twice , Suppose it is 448 The input of , The size of the feature map of each floor is 112^2,56^2,28^2,14^2,7^2 , So let's do that Anchor Already 50K A the , however FPN The input image is larger than this , Because the receptive field has the largest 512, And the shortest edge mentioned in the experiment is 800, therefore FPN Of Anchor More than the 200K individual . This is a big order of magnitude , YOLOv3 Yes 10647 individual ,SDD Yes 8732 individual ,YOLOv2 Yes 845 individual , YOLO There's only... In all 49 Category results and 98 A prediction box , But fortunately, FPN yes two-stage Structure , It doesn't matter how many boxes , Just filter the card threshold , This screening strategy and Faster R-CNN identical , In the end, there will be 200-1000 A regional suggestion box .

FPN be applied to Fast R-CNN

Note that this application is in Fast R-CNN, instead of Faster R-CNN, although RPN+Fast R-CNN = Faster R-CNN, But here we think of regional recommendations and classifications 、 Regression prediction is two independent parts , stay Fast R-CNN in , All the rest of the screening ROI The area suggestion box should be mapped to the last layer of feature map , But in FPN in , The feature map has more than one layer , So which layer of feature map do you want to map to ? This paper gives a strategy , according to ROI The size of the , Calculate the layer to map :


Is a default value 4, Suppose this ROI yes 224^2 , that k It should be equal to 4, That is to say, think of the top-level feature map .

But here's a question , If yes Faster R-CNN Applications , Originally, these area suggestions are not generated through different layers , Even after the screening, it can also correspond to the original layer , Does it need to be so complicated ? This problem is not explained in the original text . The problem of mapping is solved , There are only the final prediction categories and regression bbox 了 , stay R-FCN In our article, we introduced , If Faster R-CNN to want to ResNet Be the backbone , that RPN Convolution sharing is not at the last layer , But to go forward, a residual group , This is to make the ability of the later sub network not too weak . But in FPN There is no such use in , For two reasons :

  • The last set of residuals has been RPN To take up ;
  • The sub network computation is too large, which makes the model slow .

therefore FPN The last sub network in the has only one layer ,ROI-Pooling Control the output to 7\times7 , After a layer of full connection, we can directly predict categories and regression bbox 了 . This operation , It's like ignoring R-FCN The existence of ,R-FCN I think the main problem is , stay FPN It's not a problem in .╮( ̄▽  ̄)╭

FPN Performance evaluation

First, the figure above is 6 A structural RPN contrast ,(a) It's only from conv4 Leading branch ,(b) It's only from conv5 Leading branch ,(c) It's complete FPN structure . It can be seen that FPN The effect is obvious , and FPN Our regional recommendations exceed 200k individual . (d) It's a similar SSD Structure , Bottom to top ,(e) There is no lateral connection , A structure that relies only on upper sampling ,(f) There is no multi-layer output , All output structures at the last layer , Because the feature map is larger , therefore archors More , achieve 750k individual . Of course, their results are not as good as (c).

Above, 6 A structural Fast R-CNN contrast , Is to put SS Replace the output of the algorithm with FPN Output , Then fix the area proposal , Compare 6 Kinds of structure , still (c) The result is the best .

This contrast is the characteristic pyramid in Faster R-CNN The contrast on , This is a RPN and Faster R-CNN Is the combination of .

Finally, it's a comprehensive experiment , Is in Faster R-CNN On the application FPN, The backbone network is ResNet-101, stay COCO On AP50 by 59.1,AP by 36.2. As a two-stage detector , The effect is OK , Is the inevitable slow .

copyright notice
author[chaibubble],Please bring the original link to reprint, thank you.

Random recommended