# [intensive reading] object detection series (XI) retinanet: the pinnacle of one stage detector

2022-05-15 07:34:46

# brief introduction ：one-stage The pinnacle of the detector

stay RetinaNet Before , A common phenomenon in the field of target detection is two-stage The method has higher accuracy , But it takes more time , Like the classic Faster R-CNN,R-FCN,FPN etc. , and one-stage The method is more efficient , But the accuracy is worse , Like the classic YOLOv2,YOLOv3 and SSD. This is the general result of the different ideas of the two methods , and RetinaNet Appearance , It has improved this problem to a certain extent , Give Way one-stage The method has more advantages than two-stage Higher accuracy of the method , And it takes less time .RetinaNet My thesis is 《Focal Loss for Dense Object Detection》.

# RetinaNet principle

## Design concept

two-stage The method will complete the target detection in two steps , First, the production area suggestion box , Then do classification discrimination and regression correction on the box , and one-stage The method completes the classification and discrimination in only one step bbox The return of the . That's because of this difference , It creates the above characteristics , First of all, it's time-consuming and easy to understand , because two-stage There are two steps , The sub network in the second step needs to output repeatedly , So it's inevitably slow . So the result is two-stage The effect is good ,one-stage What is the essential reason for the effect deviation ？ Because anchor box It is caused by the serious imbalance between the positive and negative samples , Let's give an example of how unbalanced ,YOLOv2 Of anchor Yes 845 individual ,SSD Of anchor Yes 8732 individual , added anchor Provides more assumptions , Recall rate of the model （ Especially small goals ） It has great significance , But on a normal natural image , There won't be so many goals , This is bound to cause sample imbalance in follow-up tasks , And because of one-stage Of , added anchor The contradiction between the benefits and more serious sample imbalance cannot be solved structurally . So why two-stage The method has no effect ？two-stage There are many or even more methods anchor ah , such as FPN Yes 200k individual , This sum YOLOv2 Of 8k None of them are of the same order of magnitude . the reason being that two-stage Structure , So the area suggestion box can be selected , Choose whatever you want ,FPN Eliminate the serious imbalance between positive and negative samples in three aspects ：

• FPN Will choose to work with Ground turth Of IOU>0.7 Make a positive sample , And Ground turth Of IOU<0.3 Make negative samples , This widens the difference between positive and negative samples ;
• FPN Of RPN The final output will be controlled at 1000-2000 Between the two , Control the number of samples ;
• FPN Combine every time you train minibatch, The proportion of positive and negative samples is 1:3.

And just because one-stage There is no way to re screen the sample , Can we always improve in other places , To reduce the impact . This is the loss function , Because the balance of the sample , Ultimately, the impact will fall on loss and optimization . therefore RetinaNet Put forward Focal loss, It solves the problem of target detection when the positive and negative sample areas are extremely unbalanced loss The problem of being easily influenced by a large number of negative samples .

## Focal loss

Focal Loss Is an improved cross entropy loss , In general, the cross entropy loss is as long as the second classification ：

Put it another way , That's what it's like ：

Definition p_{t}

that CE(p,y)=CE(p_{t})=-log(p_{t}) . The general way to balance the cross entropy loss is to introduce a factor as a positive example \alpha\in[0,1] , So the corresponding , The factor of negative example is 1-\alpha . There's a little bit of caution here , The definition factor is mentioned in this paper \alpha_{t} The way and definition of p_{t} Is similar to that of , That is to say. ：

With this \alpha_{t} after , Balanced cross entropy loss can be written as ：

When this formula is expanded, it is actually like this ：

however \alpha It's a fixed coefficient , It has no way to distinguish which samples are difficult , Which samples are easy to , So on the basis of balancing cross entropy ,Focal loss Improvements have been made. , Get into (1-p_{t})^{\gamma } , among \gamma It's a super parameter , therefore Focal loss The expression for is ：

So why is this form better than equilibrium cross entropy ？ We assume that \gamma by 1, The coefficient becomes 1-p_{t} , On this basis , Let's expand in the above way FL(p_{t}) , It's actually like this ：

Because cross entropy was done before softmax, therefore p

It must be a positive number , This factor plus the sign that will not change the original loss , Then we give an example of how it distinguishes between difficult and easy samples ,p Is the predicted value of the positive sample ：

• For a positive example , The model thinks it's simple , that p It will approach to 1,1-p It will approach to 0, The loss will be smaller , The opposite will get bigger .
• For a negative example , The model thinks it's simple , that p It will approach to 0, The loss will be smaller , The opposite will get bigger .

This makes a difference in the difficulty of the model loss value , Besides Focal loss There is also a super parameter \gamma , The factor becomes power to , Because the base number must be less than 1 Number of numbers , The power will increase 1-p_{t} The original linear magnification . such as 1-p_{t} It was 0.1 and 0.9,\gamma=2 when , Will become 0.01 and 0.81,9 Times become 81 times . This suppresses simple samples , Ways to promote difficult samples , Actually sum IOU>0.7 and IOU<0.3 The same is true . After adding this , Hard and easy samples are distinguished , But the problem of too many negative samples doesn't seem to have been solved , therefore Focal loss Finally, the equilibrium cross entropy is added back , Used in the experiment Focal loss The form is ：

This experiment shows that \alpha and \gamma selection , stay (a) in , For equilibrium cross entropy loss , stay \alpha=0.75 when , The effect is the best , This is consistent with the results of our analysis above ,\alpha>0.5 Can suppress negative samples , But in Focal loss in ,\alpha=0.25 and \gamma=2 When , The best effect , This may be because (1-p_{t})^{\gamma } The introduction of , Affected \alpha Selection of .Focal loss yes RetinaNet The most important part , Network structure 、Anchor、 Loss and the rest RetinaNet All used before , Let's briefly mention .

## Network structure

This is RetinaNet Network structure , Actually, it 's just FPN, But it uses FPN do one-stage structure , Instead of two-stage, So the second stage is omitted . stay YOLO In the article , We talked about RPN and YOLO The difference between , When RPN No longer only do the classification of objects , But to judge what kind of object it is , That one RPN Can complete the whole set of target detection tasks . The idea is RetinaNet Used in ,RetinaNet It is equivalent to abandoning FPN Medium Fast R-CNN, Changed the FPN Medium RPN The network makes category prediction directly . therefore RetinaNet There are also many subnetworks in , Corresponding to the number of layers of the feature pyramid . As for more details , I won't introduce you .

## Anchor Box

RetinaNet selection Anchor Box The strategy and FPN be similar , Altogether 5 Feature maps of different scales , Namely 32^{2}-512^{2} , Each floor will have three proportions , therefore FPN Yes 15 Kind of Anchor, however RetinaNet On this basis, another factor is added , That is, there is another scale on the characteristic map of each layer , They are the scales of the characteristic map of this layer {2^{0},2^{\frac{1}{3}},2^{\frac{2}{3}}} , therefore RetinaNet Of Anchor Turned into 45 individual . because Focal loss The introduction of , Give Way Anchor The choice of becomes arbitrary , Just not afraid of more .╮(￣▽ ￣)╭

# RetinaNet Performance evaluation

This is RetinaNet The overall result of , stay backbone choice ResNet-101, The input resolution is 800 when ,RetinaNet Of AP More than the FPN, Although than FPN More slowly , But this is one-stage Use the same resolution as the first input of the model backbone Under the circumstances ,AP Can exceed two-stage. When the resolution becomes 500 When ,RetinaNet It has excellent performance .