current position:Home>Looking back at ResNet - a key step in the history of deep learning

Looking back at ResNet - a key step in the history of deep learning

2022-08-06 08:58:11Mr.zwX

在2020Just touching the deep learning,学习了ResNet的架构.But I didn't pay much attention at the timeResNet,直到后来,真正开始接触CV、NLP、Timing includesGraphThe research project,我才意识到ResNetThe impact on the entire field of deep learning is far-reaching.

ResNet之前,What's wrong with neural networks?

  • When the neural network layer is very large,when the depth is deep,It is prone to fatal problems of vanishing gradients or exploding gradients,As a result, the training process of the model cannot proceed.
  • When the neural network layer is very large,when the depth is deep,Models may not be able to learn really valid information,The effect of fitting the target is getting worse and worse,farther and farther from the target
    上图所示,可以看到:The left image is the previous deep learning model(如VGG),As the number of model layers increases,In fact, the model has gradually deviated from the target that needs to be fitted,So the training effect is getting worse and worse,Not even as good as what you get with a very small number of layers;右图则是ResNet希望达到的目的,As the number of model layers increases,The model is gradually approaching the target that needs to be fitted,Even if later slowly fitting effect,But there is no problem that the fitting deviation is getting bigger and bigger.

H ( x ) = F ( x ) + x H(x)=F(x)+x H(x)=F(x)+x

其中, H ( x ) H(x) H(x)is the observed output of each layer, F ( x ) F(x) F(x)Is a layer of neural network, x x x是每一层的输入(称为identity).这个residualconnection becomes skip connection(skip connection)或短路(shortcut).


Explain from a functional point of viewResNet的有效性
can be explained intuitivelyResNet的有效性:even after F ( x ) F(x) F(x)layer did not learn anything(or even learned something negatively affecting),Models can also inherit from inputs x x x的信息.The right side of the first picture can be intuitively explained,Each layer of the model can guarantee that it fully contains the information learned by the previous layer model.因此,As the model deepens,The effect does not deviate from the target,At least it will always be on the basis of the previous,进行学习.

Interpretation from a residual point of viewResNet的有效性
explained above from a functional point of viewResNetThe effectiveness of the intuitive,但是在Kaiming He的论文中,not explained that way.因为 H ( x ) = F ( x ) + x H(x)=F(x)+x H(x)=F(x)+x,故 F ( x ) = H ( x ) − x F(x)=H(x)-x F(x)=H(x)x,这里的 F ( x ) F(x) F(x)is the difference between the observed output and the input of the layer,我们称之为“残差(Residual)”.
那么,Training target is from the original fitting,Converted to fit residuals.Fit residuals are beneficial,即使 F ( x ) F(x) F(x)Can't learn what works(or even learned something negatively affecting),It will not gradually move away from the target,It can also be said that the model deviation will not increase.同时,This skip connection also avoids the training problems of vanishing or exploding gradients.
(这个思想,very similar to ensemble learningBoosting,如GBDTgradient ascent tree.Both are essentially fit residuals,But there are also differences:GBDTis the fitted labellabel,而ResNetis the fitted feature mapfeature)

The figure below shows the traditional convolutional neural network andResNet的区别:
According to the way of convolution,主要分为以下两种ResNet:

  1. Connect multiple constant height and widthResNet块(图左)
  2. The height and width are halvedResNet块(stride=2),then the number of channels increases to2倍,then it will be introducedConv1x1The number of channels has been transformed,以便最终add在一起(图右)

Design a network here:刚开始是一个conv7x7的卷积,不改变通道数.每个shortcutModule contains twoconv层,每个resnetIn front of the block contains twoshortcut模块.除了conv7x7outside the block,第一个resnetNo channel number to transform.在之后的resnet块中,只有第一个shortcutThe module needs to double the number of channels.In modules with double the number of channels,shorcut需要通过一个conv1x1The convolution transforms the number of channels.

import torch.nn as nn
from torch.nn import functional as F
import torch

# 定义shortcut模块
class Residual(nn.Module):
    def __init__(self, input_channels, num_channel, use_conv1x1=False, stride=1):
        self.conv1 = nn.Conv2d(
            in_channels=input_channels, out_channels=num_channel, kernel_size=3, stride=stride, padding=1
        self.conv2 = nn.Conv2d(
            in_channels=num_channel, out_channels=num_channel, kernel_size=3, stride=1, padding=1
        if use_conv1x1:
            self.conv3 = nn.Conv2d(
                in_channels=input_channels, out_channels=num_channel, kernel_size=1, stride=stride
            self.conv3 = None
        self.bn1 = nn.BatchNorm2d(num_features=num_channel)
        self.bn2 = nn.BatchNorm2d(num_features=num_channel)
        # batch_normalization有自己的参数,所以不能像reludefine only one
        self.relu = nn.ReLU(inplace=True)  # No need to reopen memory to store variables,更节省内存

    def forward(self, x):
        y = self.conv1(x)
        y = self.bn1(y)
        y = self.relu(y)
        y = self.conv2(y)
        y = self.bn2(y)
        if self.conv3:
            x = self.conv3(x)
        y += x
        return F.relu(y)

## residual test
resblk1 = Residual(3, 3, use_conv1x1=False, stride=1)
x = torch.rand(4, 3, 6, 6)
y = resblk1(x)

# 通常feature map长宽减半,通道数翻倍
resblk2 = Residual(3, 6, use_conv1x1=True, stride=2)
x = torch.rand(4, 3, 6, 6)
y = resblk2(x)
## residual test

# 定义resnet网络块
def resnet_block(input_channels, num_channels, num_residuals, first_block=False):
    blks = []
    for i in range(num_residuals):
   		# 是renet块中的第一个(To change the channel number),At the same time it is not the first block
        if i == 0 and not first_block:
                Residual(input_channels=input_channels, num_channel=num_channels, use_conv1x1=True, stride=2)
            blks.append(Residual(input_channels=num_channels, num_channel=num_channels, use_conv1x1=False, stride=1))
    return blks

b1 = nn.Sequential(
    nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=1),
    nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
b2 = nn.Sequential(*resnet_block(64, 64, 2, first_block=True))
b3 = nn.Sequential(*resnet_block(64, 128, 2, first_block=False))
b4 = nn.Sequential(*resnet_block(128, 256, 2, first_block=False))
b5 = nn.Sequential(*resnet_block(256, 512, 2, first_block=False))
# 这里的*是指把list展开

net = nn.Sequential(
    b1, b2, b3, b4, b5,
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Linear(512, 10)

x = torch.rand((1, 1, 224, 224))
for i, layer in enumerate(net):
    x = layer(x)
    print('layer:', i, layer.__class__.__name__, 'output shape:', x.shape)
torch.Size([4, 3, 6, 6])
torch.Size([4, 6, 3, 3])
layer: 0 Sequential output shape: torch.Size([1, 64, 55, 55])
layer: 1 Sequential output shape: torch.Size([1, 64, 55, 55])
layer: 2 Sequential output shape: torch.Size([1, 128, 28, 28])
layer: 3 Sequential output shape: torch.Size([1, 256, 14, 14])
layer: 4 Sequential output shape: torch.Size([1, 512, 7, 7])
layer: 5 AdaptiveAvgPool2d output shape: torch.Size([1, 512, 1, 1])
layer: 6 Flatten output shape: torch.Size([1, 512])
layer: 7 Linear output shape: torch.Size([1, 10])

In addition to the comments in the code,Be careful when programming:

  • F.relu()是函数调用,一般用在forwardin the final output of the function;nn.ReLU()是模块调用,Typically used when defining a network.
  • cosLearning rate is better than fixed learning rate
  • Will the accuracy of the test set be higher than the training set??其实有可能,If the training set does a lot ofdata augmentation,Then the test set may be more accurate,The training set contains noise

copyright notice
author[Mr.zwX],Please bring the original link to reprint, thank you.

Random recommended