Network depth is a crucial factor for the performance of a model, but just adding more layers leads to the problem of vanishing gradients. This problem has been addressed with normalized initialization and intermediate normalization layers. But there is also the problem of degradation: as the depth increases, accuracy gets saturated and starts degrading rapidly. ResNets address the degradation problem by introducing residual connections.
Layers are reformulated as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. The reformulation is motivated by the observation that if more layers are added to a given model, then the new model should have a training error no greater than the original model, since the new layers can be identity mappings. However, the degradation problem suggests that solvers might not be able to do this. Residual connections improve optimization by helping solvers achieve these identity mappings. The authors hypothesize that it should be easier for the solver to find the right mapping with reference to an identity mapping, than to learn the function from scratch.
The new formulation adds "shortcut connections", but does not introduce extra parameters nor computational complexity, which is quite convenient. Shortcut connections usually skip two or three layers (or even more), but they do not present improvements for a single layer. Skipped layers can be either fully-connected or convolutional layers.
The authors prove their claims with two types of networks: plain networks, whose architecture is inspired by VGG nets and are completely sequential, and residual networks, in which shortcut connections are added. The proposed method is evaluated on the ImageNet 2012 classification dataset. Results show that a 34-layer ResNet is better than a 18-layer ResNet (by 2.8%), while a 34-layer plain net has higher validation error than a shallower 18-layer plain net, indicating that the degradation problem is well addressed when increasing network depth. In addition, the 34-layer ResNet reduces the top-1 error by 3.5% compared to its plain counterpart, confirming the effectiveness of residual learning on very deep models. Furthermore, the 18-layer plain/residual nets are comparably accurate but the 18-layer ResNet converges faster, proving that ResNets also improve the optimization procedure.
It should be noted that the ensemble of two 152-layer ResNet won the ImageNet competition in 2015. GoogLeNet and VGG, the architectures that got first and second position on the previous year ImageNet challenge, were assembled only by 22 and 19 weight layers respectively. This clearly shows how residual connections resulted in a significant increase on the number of weight layers in convolutional neural networks.