current position:Home>[intensive reading] object detection series (10) FPN: introducing multi-scale with feature pyramid
[intensive reading] object detection series (10) FPN: introducing multi-scale with feature pyramid
2022-05-15 07:34:28【chaibubble】
Target detection series : object detection (object detection) series ( One ) R-CNN:CNN The pioneering work of target detection object detection (object detection) series ( Two ) SPP-Net: Let convolution calculations share object detection (object detection) series ( 3、 ... and ) Fast R-CNN:end-to-end Happy training object detection (object detection) series ( Four ) Faster R-CNN: Yes RPN Of Fast R-CNN object detection (object detection) series ( 5、 ... and ) YOLO: Another way to open target detection object detection (object detection) series ( 6、 ... and ) SSD: Balance efficiency and accuracy object detection (object detection) series ( 7、 ... and ) R-FCN: Location sensitive Faster R-CNN object detection (object detection) series ( 8、 ... and ) YOLOv2: Better , faster , stronger object detection (object detection) series ( Nine ) YOLOv3: Take the words of a hundred families to grow into a family object detection (object detection) series ( Ten ) FPN: The feature pyramid is used to introduce multi-scale object detection (object detection) series ( 11、 ... and ) RetinaNet:one-stage The pinnacle of the detector object detection (object detection) series ( Twelve ) CornerNet:anchor free The beginning object detection (object detection) series ( 13、 ... and ) CenterNet:no Anchor,no NMS object detection (object detection) series ( fourteen )FCOS: Target detection is processed by image segmentation
Extended series of target detection : object detection (object detection) Extended series ( One ) Selective Search: Selective search algorithm object detection (object detection) Extended series ( Two ) OHEM: Online hard case mining object detection (object detection) Extended series ( 3、 ... and ) Faster R-CNN,YOLO,SSD,YOLOv2,YOLOv3 The difference in the loss function
Preface : The feature pyramid is used to introduce multi-scale
SSD The algorithm proves the effectiveness of multi-layer branches for target detection , Before that two-stage The target detection method has been optimized and improved for many generations , But there has been no multi-scale method . Finally in the FPN in ,two-stage Multi scale is introduced , And in SSD Based on the multi-layer branching method , The characteristic pyramid network is proposed .FPN My thesis is 《Feature Pyramid Networks for Object Detection》.
FPN principle
Design concept
The concept of feature pyramid has existed for a long time , For example, when target detection is still using sliding window , Feature pyramid can improve the problem of target scale change .
In the figure above, there are four pyramid related operations :
- chart (a) It's an image pyramid , The pyramid is built only by scaling the image , Then extract features from images of different scales , In this way, the features are independent of each other , And it's very time consuming , Of course it and CNN It's nothing to do with half a cent ;
- chart (b) Is a conventional convolution operation , Every layer of the pyramid is no longer an image , But the extracted features , Then make a prediction on the last layer of features ,YOLO and Faster R-CNN It's all like this , As long as it uses a single layer of information ;
- chart (c) Sum graph (b) The forward process is consistent , The difference is when predicting , Pulled multi-layer branches , There is the concept of multi-scale , such as SSD;
- chart (d) The bottom-up process is also consistent with the previous two , But beyond that , Yes, a reverse top-down process is added , And there is horizontal interaction between the two processes , The final prediction is carried out on the characteristic diagram of the second process , This structure is called characteristic pyramid network by the author . Then let's specify .
Advantages of feature pyramid
SSD It is the first to introduce multi-layer feature graph branching structure into target detection task , And this characteristic pyramid structure is actually the same as SSD The multi-layer branching structure is the same , So what's good about the feature pyramid ?
Above, SSD Structure diagram , First SSD The selected branch of is not shallow enough , The top floor is VGG-16 Of conv4-3 layer , This has arrived VGG-16 The penultimate convolution layer of ,SSD Of 6 Layer branches , yes SSD stay VGG-16 The additional layer added at the back was pulled out , The more advanced feature map information is proved to be useful for small target detection , however SDD Not used . In this case , similar SSD The structure of is directly pulled to the front feature graph to take branches ? It's not good to do it directly , Because there is too little semantic information in the front feature map , Therefore, we need a structure that can use the front position feature map , This structure is the feature pyramid , Compared with SSD Multilayer branches of , The characteristic pyramid structure has two differences :
- A layer by layer sampling operation from top to bottom
- Transverse 1\times1 Convolution connection
The up sampling operation is obtained by deconvolution layer by layer from the top-level feature map , Deconvolution restores the size of the top-level feature graph , You can also restore the semantic information extracted from the top level , This information ignores the background class in the image , And restore the foreground object to the corresponding position . This is why deconvolution plays an important role in image segmentation . But there must be information loss in this kind of restoration , It's like we shrink the image first 2 times , Then zoom back , It will definitely become blurred . Small objects that have disappeared in the top-level feature map because of the down sampling operation , It won't be restored because of deconvolution . Therefore, it is also necessary to match the feature map in the bottom-up path , therefore FPN The two characteristic maps are fused ,1\times1 The purpose of convolution is to lift the channel , The feature map can be added to the corresponding positions . It looks like this :
After feature fusion , Before the predicted output , Each layer will also do a convolution without down sampling , In order to reduce the confusion caused by the fusion of upper sampling layer and original feature layer .
That's the structure , Give Way FPN High resolution 、 Feature map and low resolution of low-level semantic information 、 The combination of feature maps of high-level semantic information .
Speaking of this , In fact, if you rotate the figure of the feature pyramid 180 degree , Let it bottom up 、 Top down , You will find , This is clearly a UNet.╮( ̄▽  ̄)╭ But this also proves one point , Both deconvolution and lateral connection play a role in semantic segmentation , It also works in target detection .
FPN be applied to RPN
FPN To apply to RPN Make regional suggestions in , The following figure is a traditional RPN, It has only one layer of feature map ,RPN Use... On this feature map 3\times3\times256 Convolution kernel , A Shared 256 individual . Then the characteristic of each sliding output of convolution kernel is 1\times1\times256 , That is to say, in the figure below 256-d, Then the feature has two branches :
- The first branch (reg layer) use 4k individual 1\times1\times256 Convolution check of 256-d Convolution , The final output 4k Number , there 4 Is the parameter of a suggestion box , namely (x,y,w,h);
- The second branch (cls layer) use 2k individual 1\times1\times256 Convolution kernel convolution , The final output 2k Number , there 2 Is whether there are objects in the area , namely (object,non-object) The dichotomous problem of .
and FPN If you want to RPN Add , Just add layers ,FPN in RPN Of backbone yes ResNet, Here is ResNet Network structure :
ResNet Structure from 18 Layer to 152 layer , No matter how the number of layers changes , Their sampling positions and multiples are consistent , The first convolution layer completes a down sampling , After that, each residual group is also down sampled 2 times , The position of down sampling is conv2_x,conv3_x,conv4_x,conv5_x after , They are defined as {C_{2},C_{3},C_{4},C_{5}} , The magnification corresponding to the down sampling is 4,8,16,32 ,FPN The number of pyramid layers corresponds to the four layers , The top floor is C_{5} Output after , The bottom is C_{2} Output . The hierarchical structure has , The rest is how to choose Anchor,Faster R-CNN I chose 9 Kind of Anchor, They are three proportions and three scales , All on the same feature map , But in FPN in , Adopted and SSD A similar strategy , Different scales are allocated to different layers , Not on a layer . So finally FPN Of Anchor Scale is 32^2,64^2,128^2,256^2,512^2 , The proportion is still 1:1,2:1,1:2, therefore FPN Will produce 15 Kind of Anchor. But the problem is , The feature map shows that there are only four floors , Why does the scale have 5 individual ? This is because there is one more at the top s=2 The sliding window , Make the feeling field of this sliding window bigger again 2 times , So there was 512^2 This size .
FPN Every floor has 3 individual Anchor, Altogether 4 layer , The last layer slides twice , Suppose it is 448 The input of , The size of the feature map of each floor is 112^2,56^2,28^2,14^2,7^2 , So let's do that Anchor Already 50K A the , however FPN The input image is larger than this , Because the receptive field has the largest 512, And the shortest edge mentioned in the experiment is 800, therefore FPN Of Anchor More than the 200K individual . This is a big order of magnitude , YOLOv3 Yes 10647 individual ,SDD Yes 8732 individual ,YOLOv2 Yes 845 individual , YOLO There's only... In all 49 Category results and 98 A prediction box , But fortunately, FPN yes two-stage Structure , It doesn't matter how many boxes , Just filter the card threshold , This screening strategy and Faster R-CNN identical , In the end, there will be 200-1000 A regional suggestion box .
FPN be applied to Fast R-CNN
Note that this application is in Fast R-CNN, instead of Faster R-CNN, although RPN+Fast R-CNN = Faster R-CNN, But here we think of regional recommendations and classifications 、 Regression prediction is two independent parts , stay Fast R-CNN in , All the rest of the screening ROI The area suggestion box should be mapped to the last layer of feature map , But in FPN in , The feature map has more than one layer , So which layer of feature map do you want to map to ? This paper gives a strategy , according to ROI The size of the , Calculate the layer to map :
Is a default value 4, Suppose this ROI yes 224^2 , that k It should be equal to 4, That is to say, think of the top-level feature map .
But here's a question , If yes Faster R-CNN Applications , Originally, these area suggestions are not generated through different layers , Even after the screening, it can also correspond to the original layer , Does it need to be so complicated ? This problem is not explained in the original text . The problem of mapping is solved , There are only the final prediction categories and regression bbox 了 , stay R-FCN In our article, we introduced , If Faster R-CNN to want to ResNet Be the backbone , that RPN Convolution sharing is not at the last layer , But to go forward, a residual group , This is to make the ability of the later sub network not too weak . But in FPN There is no such use in , For two reasons :
- The last set of residuals has been RPN To take up ;
- The sub network computation is too large, which makes the model slow .
therefore FPN The last sub network in the has only one layer ,ROI-Pooling Control the output to 7\times7 , After a layer of full connection, we can directly predict categories and regression bbox 了 . This operation , It's like ignoring R-FCN The existence of ,R-FCN I think the main problem is , stay FPN It's not a problem in .╮( ̄▽  ̄)╭
FPN Performance evaluation
First, the figure above is 6 A structural RPN contrast ,(a) It's only from conv4 Leading branch ,(b) It's only from conv5 Leading branch ,(c) It's complete FPN structure . It can be seen that FPN The effect is obvious , and FPN Our regional recommendations exceed 200k individual . (d) It's a similar SSD Structure , Bottom to top ,(e) There is no lateral connection , A structure that relies only on upper sampling ,(f) There is no multi-layer output , All output structures at the last layer , Because the feature map is larger , therefore archors More , achieve 750k individual . Of course, their results are not as good as (c).
Above, 6 A structural Fast R-CNN contrast , Is to put SS Replace the output of the algorithm with FPN Output , Then fix the area proposal , Compare 6 Kinds of structure , still (c) The result is the best .
This contrast is the characteristic pyramid in Faster R-CNN The contrast on , This is a RPN and Faster R-CNN Is the combination of .
Finally, it's a comprehensive experiment , Is in Faster R-CNN On the application FPN, The backbone network is ResNet-101, stay COCO On AP50 by 59.1,AP by 36.2. As a two-stage detector , The effect is OK , Is the inevitable slow .
copyright notice
author[chaibubble],Please bring the original link to reprint, thank you.
https://en.chowdera.com/2022/131/202205102135065490.html
The sidebar is recommended
- SQLite3 minimalist Tutorial & go operating data structures using SQLite memory mode
- Penetration test - DNS rebinding
- The pytoch loading model only imports some layer weights, that is, it skips the method of specifying the network layer
- Parameter and buffer in pytoch model
- torch. nn. functional. Interpolate function
- Specify the graphics card during pytorch training
- [paper notes] Dr TANet: dynamic receptive temporary attention network for street scene change detection
- [MQ] achieve mq-08- configuration optimization from scratch fluent
- New signs are taking place in the Internet industry, and a new transformation has begun
- ACL 2022 | visual language pre training for multimodal attribute level emotion analysis
guess what you like
Cvpr2022 | latest progress in small sample behavior recognition strm framework, spatio-temporal relationship modeling is still the top priority
Hallucinations in large models
Is it safe to open an account online? Which of the top ten securities companies are state-owned enterprises?
[encapsulation tips] encapsulation of list processing function
Start with Google sea entrepreneurship accelerator - recruitment and start
Hard core preview in May! Lecture tomorrow night: virtio virtualization technology trend and DPU practice | issue 16
Druid source code reading 1 - get connection and release connection
Graduation summary of actual combat training camp
Public offering "imported products" temporarily hit the reef? The first foreign-funded public offering BlackRock fund has a lot of bad thoughts or a lot of things. It is acclimatized and the performance of the two products is poor
Introduction and installation of selenium module, use of coding platform, use of XPath, use of selenium to crawl JD product information, and introduction and installation of sketch framework
Random recommended
- Financial IT architecture - Analysis of cloud native architecture of digital bank
- [paper notes] lsnet: extreme light weight Siamese network for change detection in remote sensing image
- Mock tool equivalent to Fiddler!
- Write a program, input several integers (separated by commas) and count the number of occurrences of each integer.
- Inventory a voice conversion library
- Technology selection of micro service registration center: which of the five mainstream registration centers is the most popular?
- Summary of root cause analysis ideas and methods | ensuring it system and its stability
- JS custom string trim method
- Web3: the golden age of Creator economy
- Introduction and installation of selenium module, use of coding platform, use of XPath, use of selenium to crawl JD product information, and introduction and installation of sketch framework
- Basics: a 5-makefile example
- Database connection pool Druid source code learning (V)
- Check the box to prevent bubbling
- Click on the box to change color
- Local style and global style of uniapp
- LeetCode. 2233. Maximum product after K increases (math / minimum heap)
- Overview, BGP as, BGP neighbor, BGP update source, BGP TTL, BGP routing table, BGP synchronization
- Routing policy and routing control
- Principle and configuration of IS-IS
- C - no AC this summer
- Basic operation of linked list (complete code)
- Thread control - thread waiting, thread termination, thread separation
- Key points of acupuncture and moxibustion
- Module product and problem solution of Luogu p2260 [Tsinghua training 2012]
- Review points of Geodesy
- Summary of review points of Geodesy
- Arrangement of geodetic knowledge points
- Review key points of basic geodesy
- Luogu p2522 [haoi2011] problem B solution
- [app test] summary of test points
- Version management tool - SVN
- JDBC ~ resultset, use of resultsetmetadata, ORM idea, arbitrary field query of any table (JDBC Implementation)
- This article takes you to understand can bus
- Gear monthly update April
- Gear monthly update April
- Convert timestamp to formatted date JS
- The time stamp shows how many minutes ago and how many days ago the JS was processed
- [untitled]
- Luogu p2216 [haoi2007] ideal square problem solution
- Miscellaneous questions [2]